LlamaIndex

Log inLog into community

Find answers from the community

Updated 6 months ago

@Logan M, @WhiteFang_Jr

@Logan M, @WhiteFang_Jr

At a glance

The community members are discussing the behavior of the llama index ingestion pipeline when updating documents in the docstore and vector store. The main points are:

- The docstore supports UPSERT operations to handle existing and new documents, but it's unclear how the vector store handles updates.

- When a document is updated, the number of nodes in the vector store can increase, resulting in a combination of old and new nodes. The community members are trying to understand if there's an option to perform UPSERT operations on the vector store as well, to delete all existing nodes and insert the newly created ones.

The community members share code examples and configurations, and discuss potential issues with the document ID, vector store integration, and library versions. Eventually, one community member suggests that upgrading the package versions resolved the issue.

·

,
When using the llama index ingestion pipeline for document management, for docstore there is a option for UPSERT and other strategies for handling existing and new documents.
so if an existing document is found, it checks the doc_hash value and performs the UPSERT operation if required. This way we always have the accurate documents in the docstore.

But what happens within the vector store?
Are the existing nodes replaced with the new ones for the updated document?

For my case, the number of nodes created for a document increased after that document was updated.
The total nodes now is a combination of old nodes and new nodes.

Is there an option to perform the UPSERT operation for vector nodes as well?
like deleting all the existing nodes and insert the newly created ones?

here is my configuration for the ingestion pipeline for reference -
pipeline = IngestionPipeline(
transformations=[
parser,
title_metadata_extractor,
summary_extractor,
qa_extractor,
embed_model,
],
vector_store=vector_store,
docstore=docstore,
docstore_strategy=DocstoreStrategy.UPSERTS
)

W

S

L

28 comments

I dont think any third party vector store provides to compare the nodes before inserting yet!

meaning whenever a document gets updated, the docstore will do the comparison checks but the vector store can't?
but this way the vector store is going to contain lost of historical nodes for such documents.

Isn't it going to affect the query engine responses for such type of vector store?

but this way the vector store is going to contain lost of historical nodes for such documents. -- not really, if content is changed, nodes are deleted from the vector store and the new content is inserted

Didn't know this! Thanks for correcting ❤️

@Logan M , but for the above configured ingestion pipeline, one of the documents was updated in the docstore and new nodes were generated for that document.
When I checked the total nodes after this operation, they had increased significantly (nearly doubled). When I checked the nodes manually in the vector store, the total nodes for that document were a combination of old nodes and new nodes. (identified them using one of the timestamp variable in the metadata which basically tells us when the document was updated by the ingestion pipeline.

Now this thing is kind of creating a lot of confusion.

Assuming the document ID was the same (and the delete method on the vector store was the same) all nodes associated with the document would have been deleted before inserting

So it sounds like either
a) the document ID was not the same
b) the vector db integration you are using did not implement delete correctly

@Logan M , here is how I have configured my Mongo DB Vector Store -
vector_store = MongoDBAtlasVectorSearch(
pymongo.MongoClient(mongo_uri),
db_name=ingestion_pipeline_cfg.db_name,
collection_name=ingestion_pipeline_cfg.collection_name,
index_name=ingestion_pipeline_cfg.index_name
)

Regarding the document id, I have a staging collection, where I perform the incremental load whenever I need to update the documents.
Basically, if a new document is inserted, its "_id" field is retrieved and used as the llama index doc_id for the ingestion pipeline and if an existing document is updated in the staging collection, we are retrieving its "_id" for using as the llama index doc_id.
And I have also verified this manually that the doc_id/ref_doc_id/document_id for all the nodes of that updated document are similar to the "_id" value which is there in the docstore and staging collection for this document.
And docstore is working fine using this approach in the UPSERT mode. So I guess, document id is not the issue here.

If you have suggestions on the Vector store configurations, then please let me know.

@Logan M , @WhiteFang_Jr for visibility..

@Logan M , Hoping to get your response soon on this. Appreciate all the guidance/help provided so far.

This example works fine for me

Plain Text

from llama_index.core import Document
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.storage.docstore.mongodb import MongoDocumentStore
from llama_index.vector_stores.mongodb import MongoDBAtlasVectorSearch
from pymongo import MongoClient

docstore = MongoDocumentStore.from_uri(
    "mongodb+srv://dbUser:<password>@llama-index.wozidlv.mongodb.net/?retryWrites=true&w=majority&appName=llama-index", 
    db_name="llama_index",
    namespace="test_upsert",
)

client = MongoClient(
    "mongodb+srv://dbUser:<password>@llama-index.wozidlv.mongodb.net/?retryWrites=true&w=majority&appName=llama-index",
)

vector_store = MongoDBAtlasVectorSearch(mongodb_client=client, db_name="llama_index", collection_name="test_upsert_vector")

pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(),
        OpenAIEmbedding(),
    ],
    docstore=docstore,
    vector_store=vector_store,
)

document = Document(id_="test", text="This is a test document.")

pipeline.run(documents=[document])

I run this once, and in my db I see a single node.

I change the text slightly, run again, and again I still see one node, but now with the new text

Attachment

Seems to be working fine?

@Logan M , Really appreciate your response on this one.
Though it seems like if the text size for for the document is small, it would create only one node (based on chunk size) for the document. For this scenario, even if modify the text a bit, it would still create a single node and replace the existing node.

But for a scenario, where my text size is large such that multiple nodes would be created for a single document in the vector store, and then I modify the original text, the ingestion pipeline is creating additional nodes (which is now a combination of old nodes and new nodes).
Ironically, I have noticed a weird thing as well explained in below example:

for sentence splitter with chunk_size = 128 and chunk_overlap = 10 along with embed_model ="text-embedding-3-small",

after running the ingestion pipeline for the first time, I'm getting 3 nodes for the text.

but if I modify the text (add MODIFIED string to the original text in the starting of text), total nodes become 5 in the vector store with 2 for old document and 3 for new document. Surprisingly I don't know how but rather than having 6 nodes in total, we get 5 nodes. Upon checking found out that the old node in which text was modified, that was replaced by the new node and that's why we were left with 2 nodes for previous doc and 3 nodes of latest doc.

with text = """*** This is a test document. Adding more data now. This is not
modified yet but will be modified in the next iteration.[][][][][][]%%%%%%%%%%
Being done in order to test the UPSERT method for Vector Store..............**

Main Menu

Our Science
Clinical Expertise
- Our Studies
- Our Publications
Our Company
- About Our Company
- Leadership
- Partnerships
Newsroom
Careers
- Job Openings
""",

for first run:
document = Document(id_="test", text=text,
extrainfo={"title": "test 1"})for second run:document = Document(id="test", text=text,
extra_info={"title": "test 2"})

Could you please now test with this information at your end?

Its still working fine for me 😅

I took the same script above, and modified so that more than one node is created

Plain Text

pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(chunk_size=128, chunk_overlap=0),
        OpenAIEmbedding(),
    ],
    docstore=docstore,
    vector_store=vector_store,
)

document = Document(id_="test", text="This is a test document."*100)

1 document in the docstore, 6 nodes in the vector store

Then I run again with

Plain Text

pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(),
        OpenAIEmbedding(),
    ],
    docstore=docstore,
    vector_store=vector_store,
)

document = Document(id_="test", text="This is a MODIFIED test document."*100)

1 document in the docstore, 1 node in the vector store (since the chunk size went back to the default, 1024)

I really suggest running the above for yourself. If I had to guess, there is some difference in your code compared to how I am running this (the document ID is not consistent, something else 🤷‍♂️ )

@Logan M , this is quite weird to be honest.
So, I ran your code as is and here are the findings:

for the first run, 5 nodes were created for the document in the vector store.
for the second run with modified text along with new definition of ingestion pipeline [transformations changed], the docstore is working fine but for the vector store, I still have 5 nodes, out of which 1 node has modified text with the complete chunk whereas rest 4 nodes are from first iteration only.

Not sure how things are working differently here.

Attachments

Do you have the latest version of things installed?

Let me share the versions of libraries with you..

Name: openai
Version: 1.29.0
Summary: The official Python library for the openai API
Home-page:
Author:
Author-email: OpenAI <support@openai.com>

Name: pymongo
Version: 4.7.2
Summary: Python driver for MongoDB
Home-page:
Author: The MongoDB Python Team
Author-email:
License: Apache License

Name: llama-index
Version: 0.10.38
Summary: Interface between LLMs and your data
Home-page: https://llamaindex.ai
Author: Jerry Liu
Author-email: jerry@llamaindex.ai
License: MIT

what about pip show llama-index-storage-kvstore-mongodb and pip show llama-index-vector-stores-mongodb ?

Hi @Logan M , apologies for reverting late.

Name: llama-index-storage-kvstore-mongodb
Version: 0.1.2
Summary: llama-index kvstore mongodb integration
Home-page:
Author: Your Name
Author-email: you@example.com
License: MIT

Name: llama-index-vector-stores-mongodb
Version: 0.1.4
Summary: llama-index vector_stores mongodb integration
Home-page:
Author: Your Name
Author-email: you@example.com
License: MIT

@Logan M just for visibility in case you missed it out.

It works for me, doesn't work for you, not sure what else i can do at this point. 😅

The only thing I noticed is I had v0.1.5 of the vector store

use a different storage backend? Use a debugger and step through your code?

agreed.
for a work around, what I have done for now is deleting older nodes from the vector store for a given document based on a modified datetime field.
basically finding the max of this field and deleting nodes having value lesser than this value.

For the vector store, I will keep testing and see if things work out.

thanks for your help.

an update on the version thing, I guess it was the major issue.
After upgrading the package to this version, things are working fine it seems.

thanks for sharing the information and all the help you have provided so far.

Add a reply

Sign up and join the conversation on Discord