Find answers from the community

Updated 3 months ago

Folks I have using PGVector as the

Folks I have using PGVector as the vector store for llama index. However when add/refresh documents a new row is create in the table ? Why is that? I was under the impression that if the content is the same, and the ids are the same new embeddings won't be created and now new rows will be created in the postgres database
L
S
15 comments
indeed, if the document IDs and content is the same, it wont be inserted. BUT, this depends on you saving/loading a docstore to manage that logic.
I think the easiest approach though is to use the ingestion pipeline
It makes it much clearer whats going on
Isnt storing to doc store handles if we provide the storage and service context?
Are there any example of index creation with pipelines?
when using a vector db integration, the docstore is usually disabled to simplify storage (unless you were setting store_nodes_override=True in the constructor)

With a pipeline, just attach a vector db and docstore to the pipeline, and run it.

Then you can create the index with index = VectorStoreIndex.from_vector_store(vector_store)

Just dont forget to save/load your docstore somewhere, or use a remote docstore (mongodb, redis, etc.)
I am doing this
Plain Text
docstore = SimpleDocumentStore()
storage_context = StorageContext.from_defaults(
    vector_store=vector_store, docstore=docstore
)
index = VectorStoreIndex.from_documents(
    documents=[],
    service_context=service_context,
    storage_context=storage_context,
    show_progress=True,
    store_nodes_override=True,
)
storage_context.persist(persist_dir=datadir)
Still the docstore.json in the persist directory is empty
Plain Text
{}
Here is my full code:
Plain Text
llm = LlamaOpenAI(temperature=0, model="gpt-4")
embed_model = OpenAIEmbedding()
text_splitter = SentenceSplitter(chunk_size=512, chunk_overlap=256)
prompt_helper = PromptHelper(
    context_window=8192,
    num_output=256,
    chunk_overlap_ratio=0.1,
    chunk_size_limit=None,
)
service_context = ServiceContext.from_defaults(
    llm=llm,
    embed_model=embed_model,
    text_splitter=text_splitter,
    prompt_helper=prompt_helper,
)

url = make_url(connection_string)
port = url.port or 5432
vector_store = PGVectorStore.from_params(
    database=db_name,
    host=url.host,
    password=url.password,
    port=port,
    user=url.username,
    table_name="llama_index",
    embed_dim=1536,  # openai embedding dimension
)


docstore = SimpleDocumentStore()
storage_context = StorageContext.from_defaults(
    vector_store=vector_store, docstore=docstore
)
index = VectorStoreIndex.from_documents(
    documents=[],
    service_context=service_context,
    storage_context=storage_context,
    show_progress=True,
    store_nodes_override=True,
)
storage_context.persist(persist_dir=datadir)
print(index.refresh_ref_docs(documents))
Let me know if I am missing something I didn't explicitly add the nodes to document store mainly because I am passing store_node_override=True
I tried persisting the docstore and made sure I change the node id to hash so that for the same content I get the same ID still I see duplicate entries being created when using index.insert_nodes(nodes)
Attachment
image.png
Just use the pipeline tbh πŸ˜… it's going to be more janky than it needs to be with the index directly
Cool will give it a try
Add a reply
Sign up and join the conversation on Discord