Find answers from the community

Updated last week

when I load nodes into a vector store how do I avoid duplication? how do I just refresh th

@kapa.ai when I load nodes into a vector store how do I avoid duplication? how do I just refresh the nodes?
k
c
L
23 comments
Please stand by @cmosguy as I search the knowledge base. I normally take about 15 seconds or so. :wave:
----------------------
To avoid duplication when loading nodes into a vector store and to refresh existing nodes, LlamaIndex provides several mechanisms. Here's how you can handle this:

  1. Using the Ingestion Pipeline:
The IngestionPipeline in LlamaIndex offers built-in deduplication and refresh capabilities. You can set up the pipeline with a docstore and vector store, and use the DocstoreStrategy.UPSERTS strategy to handle updates and avoid duplicates.

Plain Text
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.storage.docstore import SimpleDocumentStore
from llama_index.core.vector_stores import SimpleVectorStore

pipeline = IngestionPipeline(
    transformations=[...],
    docstore=SimpleDocumentStore(),
    vector_store=SimpleVectorStore(...),
    docstore_strategy=DocstoreStrategy.UPSERTS
)


This setup will automatically handle deduplication and updates when you run the pipeline.

Document Management

  1. Using the refresh() method:
If you're directly managing your index, you can use the refresh() method to update existing nodes and add new ones:

Plain Text
refreshed_docs = index.refresh_ref_docs(doc_chunks)


This method will update documents with the same doc id_ but different content, and insert new documents not already present in the index.

Document Management

  1. Manual management with update_ref_doc():
For more granular control, you can use the update_ref_doc() method to update specific documents:

Plain Text
doc_chunks[0].text = "Brand new document text"
index.update_ref_doc(doc_chunks[0])


This approach allows you to update individual documents as needed.
Document Management

By using these methods, you can effectively manage your vector store to avoid duplication and keep your nodes up-to-date. The choice between these approaches depends on your specific use case and the level of control you need over the process.
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------
@Logan M the system keeps recommending deprecated ingestion pipelines
ingestion pipelines aren't deprecated
query pipelines are (similar name, very different things)
@kapa.ai if I have the following:
md_node_parser = MarkdownElementNodeParser(llm=llm_4o_2)
db = chromadb.PersistentClient(path=chromadb_path)
chroma_collection = db.get_or_create_collection("oa_classes")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

how do I use that with the Ingestion Pipeline
Let me check the knowledge base for you again. One moment! :books:
----------------------
To use the components you've set up with the Ingestion Pipeline in LlamaIndex, you can follow these steps:

  1. First, import the necessary classes:
Plain Text
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.storage.docstore import SimpleDocumentStore
from llama_index.core.node_parser import NodeParser


  1. Then, create your Ingestion Pipeline using the components you've already set up:
Plain Text
pipeline = IngestionPipeline(
    transformations=[md_node_parser],
    vector_store=vector_store,
    docstore=SimpleDocumentStore(),
    storage_context=storage_context
)


Here's a breakdown of what's happening:

  • We're using the md_node_parser you've created as part of the transformations list. This will parse your Markdown documents into nodes.
  • The vector_store you've set up with Chroma is used directly in the pipeline.
  • We're creating a new SimpleDocumentStore() for document storage. You can replace this with a different docstore if needed.
  • The storage_context you've created is passed to the pipeline, which will use the vector store you've set up.
  1. To use the pipeline, you can now run:
Plain Text
nodes = pipeline.run(documents=your_documents)


Replace your_documents with the list of documents you want to ingest.

This setup will use your Chroma vector store for storing embeddings, and the Markdown parser to process your documents. The Ingestion Pipeline will handle the process of transforming your documents into nodes and storing them in both the docstore and vector store.

Remember, if you want to avoid duplication and refresh existing nodes, you can set the docstore_strategy parameter when creating the pipeline:

```python
from llama_index.core.storage.docstore import DocstoreStrategy
pipeline = IngestionPipeline(
transformations=[md_node_parser],
vector_store=vector_store,
docstore=SimpleDocumentStore(),
storage_context=storage_context,
docstore_strategy=DocstoreStrategy.UPSERTS
)
```

This will update existing documents and add new ones, avoiding duplication in your vector store.

LlamaIndex > En > Latest > Examples > Vector_stores > save to disk

LlamaIndex > En > Latest > Community > Integrations > Using Vector Stores
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------
@Logan M sorry but the IngestionPipeline does not take storage_context as an argument, this thing is confused, what do you recommend?
just remove the storage context from the bots example
docstore and vector_store can be passed directly
but this assumes you are saving and loading the docstore between ingestions
wait i am using chroma db, so it is putting it in automatically, right?
not sure what you mean. Its putting stuff in chroma yes
And if you add a docstore to the pipleine, it adds extra info to the docstore to help with deduplicating
does this look good to you?

Plain Text
 db = chromadb.PersistentClient(path=chromadb_path)
    chroma_collection = db.get_or_create_collection("oa_classes")
    vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
    storage_context = StorageContext.from_defaults(vector_store=vector_store)
    pipeline = IngestionPipeline(
        transformations=[MarkdownElementNodeParser(llm=llm_4o_2)],
        vector_store=vector_store,
        docstore=SimpleDocumentStore(),
        docstore_strategy=DocstoreStrategy.UPSERTS,
    )
    try:
        pipeline.load(storage_path_root)
    except FileNotFoundError:
        pass
    for json_file in json_files[:3]:
        oa_class_docs = create_docs_from_json_file(json_file)

        nodes = await pipeline.arun(documents=oa_class_docs)
yea that looks fine to me, but you might want to save/load the SimpleDocumentStore between runs
thanks for the guidance!
Add a reply
Sign up and join the conversation on Discord