Find answers from the community

s
F
Y
a
P
Updated 11 months ago

and here's my code for postgres vector

and here's my code for postgres vector store:

from llama_index import VectorStoreIndex
from llama_index.node_parser import SimpleNodeParser
from llama_index.ingestion import IngestionPipeline, IngestionCache
from llama_index.ingestion.cache import RedisCache
from llama_index.vector_stores import PGVectorStore

pg_vector_store = PGVectorStore.from_params(
database=config["Database"]["DB_NAME"],
host=config["Database"]["DB_HOST"],
password=config["Database"]["DB_PASSWORD"],
port=config["Database"]["DB_PORT"],
user=config["Database"]["DB_USER"],
table_name= db_table_name,
embed_dim=384, # bge-small-v1.5 embedding dimension
hybrid_search=True,
text_search_config="english",
)

ingest_cache = IngestionCache(
cache=RedisCache.from_host_and_port(host="127.0.0.1", port=6379),
collection="my_test_cache",
)
pipeline = IngestionPipeline(
transformations=[
SimpleNodeParser(chunk_size=512, chunk_overlap=20),
embed_model,
],
vector_store = pg_vector_store,
cache = ingest_cache,
)

pipeline.run(documents)

# build index
vector_index = VectorStoreIndex.from_vector_store(vector_store = pg_vector_store, show_progress=True)
L
A
11 comments
This makes more sense, now you attached the vector store.
Thanks @Logan M ! I found the issue. Everything else was correct, I just had to change pipeline.run(documents) to pipeline.run(documents = documents) πŸ˜„
I still have a couple of questions though
1) Ingestion cache - will it only save time if the index is exactly same? I added an extra doc and reindex it and it took the usual amount. Is that understanding accurate?
2) Initial indexing took about 170 secs but subsequent indexing (with exact same documents) took about 70 secs. Does that sound right? Is it how long it is supposed to take even after caching?
1.It hashes the combination of inputs + transformation at each step.

So adding a document would be a cache miss. If you added a transformation to the end of the pipeline instead, it would only run the new transform

  1. It's still inserting nodes each time. If you are planning to rerun the same data and dedup, check out the page I linked that introduces the docstore to the pipeline
Thanks @Logan M ! I'll check it out
@Logan M , what's the difference between upserts and duplicates_only docstore_strategy? I couldn't find a clear explanation in the documentation
Upserts only works if you attach both a docstore and vector store to the pipeline.

It uses the IDs of inputs as an anchor and compares hashes organized in the docstore. If the ID of an input is the same as something the pipeline has seen before, it will delete and add the new item (I.e. upsert)

Duplicates only is where If the hash of an input matches a hash that the docstore has, it is skipped
Hopefully that kind of made sense lol
Thanks @Logan M ! Honestly, a little bit lol! I think I might have to just try both and understand what happens πŸ™‚
Add a reply
Sign up and join the conversation on Discord