Find answers from the community

Updated 4 months ago

Docstore

At a glance

The community member is working with Node PostProcessors and is trying to add a PrevNext stage to their reranking process. They are using a Weaviate vector store and a Redis docstore. The PrevNext postprocessor is failing with a "doc_id not found" error, which the community member believes is related to the docstore assignment.

The community members discuss several potential solutions, including:

  • Integrating the Redis docstore into the index, as it is currently only being used for ingestion
  • Abandoning the PrevNext postprocessor if using Weaviate vector store and Redis docstore
  • Explicitly adding the output nodes to the docstore after the ingestion pipeline runs
  • Potentially needing an IndexStore in addition to the DocStore
  • Avoiding duplicating documents in the docstore by only adding them outside of the ingestion pipeline
  • Using a file watcher to process one file at a time and pass the file path to the ingestion script to prevent duplicates

There is no explicitly marked answer, but the community members seem to have worked through the issue and found a solution involving adding the output nodes to the docstore an

Useful resources
I'm messing around with Node PostProcessors. Reranking is working fine and i thought I would add in a second stage of trying PrevNext.

Plain Text
    index = VectorStoreIndex.from_vector_store(
        vector_store, service_context=service_context
    )


    prevnext = PrevNextNodePostprocessor(
        docstore=index.docstore,
        num_nodes=1,
        mode="previous",
    )


    rerank = SentenceTransformerRerank(
        model="cross-encoder/ms-marco-MiniLM-L-6-v2", top_n=3)



    chat_engine = index.as_chat_engine(chat_mode="context", memory=memory, verbose=True, streaming=STREAMING, service_context=service_context, similarity_top_k=10, node_postprocessors=[rerank, prevnext], ....



When it gets to the PrevNext post processor it fails with "doc_id 0870c4a7-f226-4988-bc82-1702652d8f7e not found."

I have a feeling it has something to do with the docstore= assignement in prevnext but not sure where I'm going wrong

VectorStore = Weaviate
L
c
15 comments
The docstore is disabled with vector db integrations, to simplify storage.

So with weaviate, it will be empty. And the postprocessor fails
thanks Logan - I have the docstore in REDIS. Should I be integrating that into the index somehow? I'm only utilizing it now for ingestion
or should I abandoned the prevnext post processing completely if using weaviate Vector /Redis Docstore
wait - i think i see what i need to do. will report back. You sent me on the right path. thank you
Yea, what I was going to suggest was using docstore.add_documents(nodes) with the nodes that the ingestion pipeline outputs

Then, use that docstore in the prev/next thing
well the docstore is getting populated as part of ingesting pipeline correctly. So I am left with a vector db and a document store.

When I go to spin up the chat engine, it spins up from what was stored previously.

Plain Text
 vector_store=WeaviateVectorStore(weaviate_client=client, index_name=collection_name, text_key="content")

    docstore=RedisDocumentStore.from_host_and_port(
            "localhost", 6379, namespace="MyDocuments"
            )

    storage_context = StorageContext.from_defaults(docstore=docstore)

    rerank = SentenceTransformerRerank(
        model="cross-encoder/ms-marco-MiniLM-L-6-v2", top_n=3
    )


    prevnext = PrevNextNodePostprocessor(
        
docstore=storage_context.docstore,
        num_nodes=1,
        mode="both",
    )


I thought this would work but didn't quite work out
The docstore is only adding the intial input document. But for the prevnext to work, we need to add the output nodes too

Plain Text
docstore = RedisDocumentStore.from_host_and_port(
  "localhost", 6379, namespace="MyDocuments"
)
pipeline = IngestionPipeline(..., docstore=docstore)

nodes = pipeline.run(documents=documents)

# remove embeddings to save space, add to docstore
for node in nodes:
  node.embedding = None
docstore.add_documents(nodes)

prevnext = PrevNextNodePostprocessor(docstore=docstore, ...)
then it will work I think
It seems that the missing piece is that I need the IndexStore along with the DocStore.

Unfortunately my IngestionPipeline isn't accepting index_store although it handles docstore which is what I've been doing.

Referring to
https://docs.llamaindex.ai/en/stable/examples/docstore/RedisDocstoreIndexStoreDemo.html#add-to-docstore
in your example. Would the documents not be added multiple times to docstore?

Once as part of pipeline with
IngestionPipeline(..., docstore=docstore

and another time with
docstore.add_documents(nodes)
So I got all of it working. Still not sure if it is possible or not to add index_store to pipeline ingestion just like vector/doc store.

I feel like I'm duplicate my docstore documents by doing it outside of the ingestion.
You are kind of duplicating, but tbh it's not a huge deal. Ingestion pipeline does not accept an index store, because it's not an index
Here's a demo with an alternative. But it's kind of annoying af πŸ˜… https://github.com/run-llama/llama_index/issues/8832#issuecomment-1805969818
well this does seem to be my exact issue. This is helpful - I'm switching my ingestion to only work off of the FileWatcher and process 1 file through pipeline at a time to prevent duplicates.

Using the inotifywait then pass filename into the ingestion python script with filepath as param
Plain Text
inotifywait -m -r -e create --format '%w%f' "$DIRECTORY_TO_MONITOR" | while read FILE
do
    # Check if the path is a regular file (not a directory)
    if [ -f "$FILE" ]; then
        echo "New file added: $FILE"
        # Pass the file path to the Python script
        python "$PYTHON_SCRIPT" "$FILE"
    else
        echo "Ignored directory change: $FILE"
    fi
done
Add a reply
Sign up and join the conversation on Discord