I’m trying to update the nodes of a document that is st...

At a glance

The community member @kapa.ai is trying to update the nodes of a document stored in the Chroma DB using an ingestion pipeline. The community member @cmosguy provides an example of how to update the metadata of a document in the Chroma DB, but the community member @kapa.ai indicates that this does not seem right and asks if there is a docstore strategy that can update the node embeddings or nodes if the transformations of the pipeline are changed.

The community member @cmosguy then explains that the LlamaIndex IngestionPipeline provides a mechanism to handle updates to documents and their nodes through the docstore_strategy parameter. If the docstore_strategy is set to DocstoreStrategy.UPSERTS, the pipeline will handle upserting data into the vector store, re-processing and upserting the document if a duplicate doc_id is detected and the hash of the document has changed.

The community member @kapa.ai then asks how to update the nodes even if the document has not changed. The community member @cmosguy explains that the update_ref_doc method provided by LlamaIndex can be used to update a document and its corresponding

ccmosguy

@kapa.ai I’m trying to update the nodes of a document that is stored in the chroma db. How do I do that with an ingestion pipeline?

15 comments

kkapa.ai

Please stand by @cmosguy as I search the knowledge base. I normally take about 20 seconds or so. :wave:

kkapa.ai

----------------------
To update the nodes of a document stored in the Chroma DB using an ingestion pipeline, you would first need to retrieve the document you want to update. You can do this by using the get method on the chroma_collection object.

Here is an example of how you can update a document:

Plain Text

doc_to_update = chroma_collection.get(limit=1)
doc_to_update["metadatas"][0] = {
    **doc_to_update["metadatas"][0],
    **{"author": "Paul Graham"},
}
chroma_collection.update(
    ids=[doc_to_update["ids"][0]], metadatas=[doc_to_update["metadatas"][0]]
)
updated_doc = chroma_collection.get(limit=1)
print(updated_doc["metadatas"][0])

In this example, the get method is used to retrieve the first document from the Chroma DB. The metadata of the document is then updated to change the author to "Paul Graham". The update method is then used to update the document in the Chroma DB. The updated document is then retrieved and its metadata is printed to confirm the update.

Please note that you would need to replace "author": "Paul Graham" with the actual changes you want to make to the document's nodes.

You can find more details about this in the ChromaIndexDemo () notebook.
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------

ccmosguy

@kapa.ai this does not seem right. Isn’t there a docstore strategy that will update the node embeddings or nodes of I change the transformations of the pipeline?

kkapa.ai

Let me check the knowledge base for you. One moment! :books:

kkapa.ai

----------------------
Yes, you're correct. The LlamaIndex IngestionPipeline provides a mechanism to handle updates to documents and their nodes. This is managed through the docstore_strategy parameter of the IngestionPipeline.

If you have a vector store attached to your pipeline, and you set the docstore_strategy to DocstoreStrategy.UPSERTS, the pipeline will handle upserting data into your vector store. This means that if a duplicate doc_id is detected, and the hash of the document has changed, the document will be re-processed and upserted. If the hash is unchanged, the node is skipped.

Here's an example of how you can set up an ingestion pipeline with a docstore and a vector store:

Plain Text

from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core.ingestion import (
    DocstoreStrategy,
    IngestionPipeline,
    IngestionCache,
)
from llama_index.core.ingestion.cache import RedisCache
from llama_index.storage.docstore.redis import RedisDocumentStore
from llama_index.core.node_parser import SentenceSplitter
from llama_index.vector_stores.redis import RedisVectorStore

embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(),
        embed_model,
    ],
    docstore=RedisDocumentStore.from_host_and_port(
        "localhost", 6379, namespace="document_store"
    ),
    vector_store=RedisVectorStore(
        index_name="redis_vector_store",
        index_prefix="vectore_store",
        redis_url="redis://localhost:6379",
    ),
    cache=IngestionCache(
        cache=RedisCache.from_host_and_port("localhost", 6379),
        collection="redis_cache",
    ),
    docstore_strategy=DocstoreStrategy.UPSERTS,
)

kkapa.ai

In this example, the IngestionPipeline is set up with a RedisDocumentStore as the docstore and a RedisVectorStore as the vector store. The docstore_strategy is set to DocstoreStrategy.UPSERTS, which means that the pipeline will handle upserting data into the vector store.

You can find more details about this in the LlamaIndex documentation ().
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------

ccmosguy

@kapa.ai what if I want to update the nodes even if I did not change the document?

kkapa.ai

Let me check the knowledge base for you. One moment! :books:

kkapa.ai

----------------------
If you want to update the nodes even if the document has not changed, you can use the update_ref_doc method provided by LlamaIndex. This method updates a document and its corresponding nodes. It's equivalent to deleting the document and then inserting it again.

Here's an example of how you can use it:

Plain Text

doc_chunks[0].text = "Brand new document text"
index.update_ref_doc(
    doc_chunks[0],
    update_kwargs={"delete_kwargs": {"delete_from_docstore": True}},
)

In this example, the text of the first chunk of the document is updated. Then, the update_ref_doc method is called to update the document in the index. The delete_kwargs parameter is used to ensure that the document is deleted from the docstore before it's updated.

Please note that the update_ref_doc method is part of the BaseIndex class in LlamaIndex. Depending on the specific index you're using, the implementation of this method may vary. For example, the update_ref_doc method is not implemented for the LlamaCloudIndex and will raise a NotImplementedError if called.

You can find more details about this in the LlamaIndex documentation ().
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------

ccmosguy

@kapa.ai how do I do this in a pipeline?

kkapa.ai

Let me check the knowledge base for you. One moment! :books:

kkapa.ai

----------------------
I'm sorry for the confusion, but based on the provided knowledge sources, there isn't a direct way to update the nodes of a document in a pipeline without changing the document itself. The IngestionPipeline in LlamaIndex is designed to process and ingest new or updated documents into a document store and vector store.

If you want to update the nodes of a document without changing the document itself, you might need to manually remove the document from the document store and vector store, and then re-ingest it using the pipeline. This would effectively "update" the nodes by re-processing the document.

However, this is not a standard use case for the IngestionPipeline, and it might not be the most efficient way to achieve your goal. If you frequently need to update the nodes of documents without changing the documents themselves, you might want to consider a different approach or tool that is better suited to your needs.

For more information on how to use the IngestionPipeline, you can refer to the LlamaIndex documentation ().
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------

ccmosguy

@kapa.ai how do I remove a document and nodes from the docstore and vector database?

kkapa.ai

Let me check the knowledge base for you. One moment! :books:

kkapa.ai

----------------------
To remove a document and its nodes from the docstore and vector database in LlamaIndex, you can use the delete_ref_doc method provided by the index. This method deletes a document and its corresponding nodes from the index.

Here's an example of how you can use it:

Plain Text

index.delete_ref_doc("doc_id_0", delete_from_docstore=True)

In this example, the delete_ref_doc method is called with the ID of the document you want to delete. The delete_from_docstore parameter is set to True, which means that the document will also be deleted from the docstore.

Please note that the delete_ref_doc method is part of the BaseIndex class in LlamaIndex. Depending on the specific index you're using, the implementation of this method may vary.

For more information on how to delete documents and nodes, you can refer to the following sources:

LlamaIndex Documentation: delete_ref_doc ()
LlamaIndex Documentation: Document Management ()
LlamaIndex Examples: Deleting documents ()
LlamaIndex Examples: Removing documents from the index ()

----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------

Add a reply

Find answers from the community

I’m trying to update the nodes of a document that is stored in the chroma db. How do I do