and here's my code for postgres vector

At a glance

and here's my code for postgres vector store:

from llama_index import VectorStoreIndex
from llama_index.node_parser import SimpleNodeParser
from llama_index.ingestion import IngestionPipeline, IngestionCache
from llama_index.ingestion.cache import RedisCache
from llama_index.vector_stores import PGVectorStore

pg_vector_store = PGVectorStore.from_params(
database=config["Database"]["DB_NAME"],
host=config["Database"]["DB_HOST"],
password=config["Database"]["DB_PASSWORD"],
port=config["Database"]["DB_PORT"],
user=config["Database"]["DB_USER"],
table_name= db_table_name,
embed_dim=384, # bge-small-v1.5 embedding dimension
hybrid_search=True,
text_search_config="english",
)

ingest_cache = IngestionCache(
cache=RedisCache.from_host_and_port(host="127.0.0.1", port=6379),
collection="my_test_cache",
)
pipeline = IngestionPipeline(
transformations=[
SimpleNodeParser(chunk_size=512, chunk_overlap=20),
embed_model,
],
vector_store = pg_vector_store,
cache = ingest_cache,
)

pipeline.run(documents)

# build index
vector_index = VectorStoreIndex.from_vector_store(vector_store = pg_vector_store, show_progress=True)

11 comments

LLogan M

This makes more sense, now you attached the vector store.

AAnurag Agrawal

Thanks @Logan M ! I found the issue. Everything else was correct, I just had to change pipeline.run(documents) to pipeline.run(documents = documents) 😄

AAnurag Agrawal

I still have a couple of questions though

AAnurag Agrawal

1) Ingestion cache - will it only save time if the index is exactly same? I added an extra doc and reindex it and it took the usual amount. Is that understanding accurate?

AAnurag Agrawal

2) Initial indexing took about 170 secs but subsequent indexing (with exact same documents) took about 70 secs. Does that sound right? Is it how long it is supposed to take even after caching?

LLogan M

1.It hashes the combination of inputs + transformation at each step.

So adding a document would be a cache miss. If you added a transformation to the end of the pipeline instead, it would only run the new transform

It's still inserting nodes each time. If you are planning to rerun the same data and dedup, check out the page I linked that introduces the docstore to the pipeline

AAnurag Agrawal

Thanks @Logan M ! I'll check it out

AAnurag Agrawal

@Logan M , what's the difference between upserts and duplicates_only docstore_strategy? I couldn't find a clear explanation in the documentation

LLogan M

Upserts only works if you attach both a docstore and vector store to the pipeline.

It uses the IDs of inputs as an anchor and compares hashes organized in the docstore. If the ID of an input is the same as something the pipeline has seen before, it will delete and add the new item (I.e. upsert)

Duplicates only is where If the hash of an input matches a hash that the docstore has, it is skipped

LLogan M

Hopefully that kind of made sense lol

AAnurag Agrawal

Thanks @Logan M ! Honestly, a little bit lol! I think I might have to just try both and understand what happens 🙂

Add a reply

Find answers from the community

and here's my code for postgres vector