when I load nodes into a vector store how do I avoid du...

At a glance

@kapa.ai when I load nodes into a vector store how do I avoid duplication? how do I just refresh the nodes?

23 comments

Please stand by @cmosguy as I search the knowledge base. I normally take about 15 seconds or so. :wave:

----------------------
To avoid duplication when loading nodes into a vector store and to refresh existing nodes, LlamaIndex provides several mechanisms. Here's how you can handle this:

Using the Ingestion Pipeline:

The IngestionPipeline in LlamaIndex offers built-in deduplication and refresh capabilities. You can set up the pipeline with a docstore and vector store, and use the DocstoreStrategy.UPSERTS strategy to handle updates and avoid duplicates.

Plain Text

from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.storage.docstore import SimpleDocumentStore
from llama_index.core.vector_stores import SimpleVectorStore

pipeline = IngestionPipeline(
    transformations=[...],
    docstore=SimpleDocumentStore(),
    vector_store=SimpleVectorStore(...),
    docstore_strategy=DocstoreStrategy.UPSERTS
)

This setup will automatically handle deduplication and updates when you run the pipeline.

Document Management

Using the refresh() method:

If you're directly managing your index, you can use the refresh() method to update existing nodes and add new ones:

Plain Text

refreshed_docs = index.refresh_ref_docs(doc_chunks)

This method will update documents with the same doc id_ but different content, and insert new documents not already present in the index.

Document Management

Manual management with update_ref_doc():

For more granular control, you can use the update_ref_doc() method to update specific documents:

Plain Text

doc_chunks[0].text = "Brand new document text"
index.update_ref_doc(doc_chunks[0])

This approach allows you to update individual documents as needed.

kkapa.ai

Document Management

By using these methods, you can effectively manage your vector store to avoid duplication and keep your nodes up-to-date. The choice between these approaches depends on your specific use case and the level of control you need over the process.
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------

ccmosguy

@Logan M the system keeps recommending deprecated ingestion pipelines

LLogan M

ingestion pipelines aren't deprecated

LLogan M

query pipelines are (similar name, very different things)

ccmosguy

Ok thanks!

ccmosguy

@kapa.ai if I have the following:
md_node_parser = MarkdownElementNodeParser(llm=llm_4o_2)
db = chromadb.PersistentClient(path=chromadb_path)
chroma_collection = db.get_or_create_collection("oa_classes")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

how do I use that with the Ingestion Pipeline

kkapa.ai

Let me check the knowledge base for you again. One moment! :books:

kkapa.ai

----------------------
To use the components you've set up with the Ingestion Pipeline in LlamaIndex, you can follow these steps:

First, import the necessary classes:

Plain Text

from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.storage.docstore import SimpleDocumentStore
from llama_index.core.node_parser import NodeParser

Then, create your Ingestion Pipeline using the components you've already set up:

Plain Text

pipeline = IngestionPipeline(
    transformations=[md_node_parser],
    vector_store=vector_store,
    docstore=SimpleDocumentStore(),
    storage_context=storage_context
)

Here's a breakdown of what's happening:

We're using the md_node_parser you've created as part of the transformations list. This will parse your Markdown documents into nodes.
The vector_store you've set up with Chroma is used directly in the pipeline.
We're creating a new SimpleDocumentStore() for document storage. You can replace this with a different docstore if needed.
The storage_context you've created is passed to the pipeline, which will use the vector store you've set up.

To use the pipeline, you can now run:

Plain Text

nodes = pipeline.run(documents=your_documents)

Replace your_documents with the list of documents you want to ingest.

This setup will use your Chroma vector store for storing embeddings, and the Markdown parser to process your documents. The Ingestion Pipeline will handle the process of transforming your documents into nodes and storing them in both the docstore and vector store.

Remember, if you want to avoid duplication and refresh existing nodes, you can set the docstore_strategy parameter when creating the pipeline:

```python
from llama_index.core.storage.docstore import DocstoreStrategy

kkapa.ai

pipeline = IngestionPipeline(
transformations=[md_node_parser],
vector_store=vector_store,
docstore=SimpleDocumentStore(),
storage_context=storage_context,
docstore_strategy=DocstoreStrategy.UPSERTS
)
```

This will update existing documents and add new ones, avoiding duplication in your vector store.

LlamaIndex > En > Latest > Examples > Vector _stores > save to disk

LlamaIndex > En > Latest > Community > Integrations > Using Vector Stores
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------

ccmosguy

@Logan M sorry but the IngestionPipeline does not take storage_context as an argument, this thing is confused, what do you recommend?

LLogan M

just remove the storage context from the bots example

ccmosguy

LLogan M

docstore and vector_store can be passed directly

LLogan M

but this assumes you are saving and loading the docstore between ingestions

ccmosguy

wait i am using chroma db, so it is putting it in automatically, right?

LLogan M

not sure what you mean. Its putting stuff in chroma yes

LLogan M

And if you add a docstore to the pipleine, it adds extra info to the docstore to help with deduplicating

ccmosguy

does this look good to you?

Plain Text

 db = chromadb.PersistentClient(path=chromadb_path)
    chroma_collection = db.get_or_create_collection("oa_classes")
    vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
    storage_context = StorageContext.from_defaults(vector_store=vector_store)
    pipeline = IngestionPipeline(
        transformations=[MarkdownElementNodeParser(llm=llm_4o_2)],
        vector_store=vector_store,
        docstore=SimpleDocumentStore(),
        docstore_strategy=DocstoreStrategy.UPSERTS,
    )
    try:
        pipeline.load(storage_path_root)
    except FileNotFoundError:
        pass
    for json_file in json_files[:3]:
        oa_class_docs = create_docs_from_json_file(json_file)

        nodes = await pipeline.arun(documents=oa_class_docs)

LLogan M

yea that looks fine to me, but you might want to save/load the SimpleDocumentStore between runs

ccmosguy

thanks for the guidance!

Add a reply

Find answers from the community

when I load nodes into a vector store how do I avoid duplication? how do I just refresh th