Docstore

At a glance

The community member is working with Node PostProcessors and is trying to add a PrevNext stage to their reranking process. They are using a Weaviate vector store and a Redis docstore. The PrevNext postprocessor is failing with a "doc_id not found" error, which the community member believes is related to the docstore assignment.

The community members discuss several potential solutions, including:

Integrating the Redis docstore into the index, as it is currently only being used for ingestion
Abandoning the PrevNext postprocessor if using Weaviate vector store and Redis docstore
Explicitly adding the output nodes to the docstore after the ingestion pipeline runs
Potentially needing an IndexStore in addition to the DocStore
Avoiding duplicating documents in the docstore by only adding them outside of the ingestion pipeline
Using a file watcher to process one file at a time and pass the file path to the ingestion script to prevent duplicates

There is no explicitly marked answer, but the community members seem to have worked through the issue and found a solution involving adding the output nodes to the docstore an

Useful resources

ccodester

I'm messing around with Node PostProcessors. Reranking is working fine and i thought I would add in a second stage of trying PrevNext.

Plain Text

    index = VectorStoreIndex.from_vector_store(
        vector_store, service_context=service_context
    )


    prevnext = PrevNextNodePostprocessor(
        docstore=index.docstore,
        num_nodes=1,
        mode="previous",
    )


    rerank = SentenceTransformerRerank(
        model="cross-encoder/ms-marco-MiniLM-L-6-v2", top_n=3)



    chat_engine = index.as_chat_engine(chat_mode="context", memory=memory, verbose=True, streaming=STREAMING, service_context=service_context, similarity_top_k=10, node_postprocessors=[rerank, prevnext], ....

When it gets to the PrevNext post processor it fails with "doc_id 0870c4a7-f226-4988-bc82-1702652d8f7e not found."

I have a feeling it has something to do with the docstore= assignement in prevnext but not sure where I'm going wrong

VectorStore = Weaviate

15 comments

LLogan M

The docstore is disabled with vector db integrations, to simplify storage.

So with weaviate, it will be empty. And the postprocessor fails

ccodester

thanks Logan - I have the docstore in REDIS. Should I be integrating that into the index somehow? I'm only utilizing it now for ingestion

ccodester

or should I abandoned the prevnext post processing completely if using weaviate Vector /Redis Docstore

ccodester

wait - i think i see what i need to do. will report back. You sent me on the right path. thank you

LLogan M

Yea, what I was going to suggest was using docstore.add_documents(nodes) with the nodes that the ingestion pipeline outputs

Then, use that docstore in the prev/next thing

ccodester

well the docstore is getting populated as part of ingesting pipeline correctly. So I am left with a vector db and a document store.

When I go to spin up the chat engine, it spins up from what was stored previously.

Plain Text

 vector_store=WeaviateVectorStore(weaviate_client=client, index_name=collection_name, text_key="content")

    docstore=RedisDocumentStore.from_host_and_port(
            "localhost", 6379, namespace="MyDocuments"
            )

    storage_context = StorageContext.from_defaults(docstore=docstore)

    rerank = SentenceTransformerRerank(
        model="cross-encoder/ms-marco-MiniLM-L-6-v2", top_n=3
    )


    prevnext = PrevNextNodePostprocessor(
        
docstore=storage_context.docstore,
        num_nodes=1,
        mode="both",
    )

I thought this would work but didn't quite work out

LLogan M

The docstore is only adding the intial input document. But for the prevnext to work, we need to add the output nodes too

Plain Text

docstore = RedisDocumentStore.from_host_and_port(
  "localhost", 6379, namespace="MyDocuments"
)
pipeline = IngestionPipeline(..., docstore=docstore)

nodes = pipeline.run(documents=documents)

# remove embeddings to save space, add to docstore
for node in nodes:
  node.embedding = None
docstore.add_documents(nodes)

prevnext = PrevNextNodePostprocessor(docstore=docstore, ...)

LLogan M

then it will work I think

ccodester

It seems that the missing piece is that I need the IndexStore along with the DocStore.

Unfortunately my IngestionPipeline isn't accepting index_store although it handles docstore which is what I've been doing.

Referring to
https://docs.llamaindex.ai/en/stable/examples/docstore/RedisDocstoreIndexStoreDemo.html#add-to-docstore

ccodester

in your example. Would the documents not be added multiple times to docstore?

Once as part of pipeline with
IngestionPipeline(..., docstore=docstore

and another time with
docstore.add_documents(nodes)

ccodester

So I got all of it working. Still not sure if it is possible or not to add index_store to pipeline ingestion just like vector/doc store.

I feel like I'm duplicate my docstore documents by doing it outside of the ingestion.

LLogan M

You are kind of duplicating, but tbh it's not a huge deal. Ingestion pipeline does not accept an index store, because it's not an index

LLogan M

Here's a demo with an alternative. But it's kind of annoying af 😅 https://github.com/run-llama/llama_index/issues/8832#issuecomment-1805969818

ccodester

well this does seem to be my exact issue. This is helpful - I'm switching my ingestion to only work off of the FileWatcher and process 1 file through pipeline at a time to prevent duplicates.

Using the inotifywait then pass filename into the ingestion python script with filepath as param

ccodester

Plain Text

inotifywait -m -r -e create --format '%w%f' "$DIRECTORY_TO_MONITOR" | while read FILE
do
    # Check if the path is a regular file (not a directory)
    if [ -f "$FILE" ]; then
        echo "New file added: $FILE"
        # Pass the file path to the Python script
        python "$PYTHON_SCRIPT" "$FILE"
    else
        echo "Ignored directory change: $FILE"
    fi
done

Add a reply

Find answers from the community

Docstore