Find answers from the community

Updated 3 months ago

Docstore

I'm messing around with Node PostProcessors. Reranking is working fine and i thought I would add in a second stage of trying PrevNext.

Plain Text
    index = VectorStoreIndex.from_vector_store(
        vector_store, service_context=service_context
    )


    prevnext = PrevNextNodePostprocessor(
        docstore=index.docstore,
        num_nodes=1,
        mode="previous",
    )


    rerank = SentenceTransformerRerank(
        model="cross-encoder/ms-marco-MiniLM-L-6-v2", top_n=3)



    chat_engine = index.as_chat_engine(chat_mode="context", memory=memory, verbose=True, streaming=STREAMING, service_context=service_context, similarity_top_k=10, node_postprocessors=[rerank, prevnext], ....



When it gets to the PrevNext post processor it fails with "doc_id 0870c4a7-f226-4988-bc82-1702652d8f7e not found."

I have a feeling it has something to do with the docstore= assignement in prevnext but not sure where I'm going wrong

VectorStore = Weaviate
L
c
15 comments
The docstore is disabled with vector db integrations, to simplify storage.

So with weaviate, it will be empty. And the postprocessor fails
thanks Logan - I have the docstore in REDIS. Should I be integrating that into the index somehow? I'm only utilizing it now for ingestion
or should I abandoned the prevnext post processing completely if using weaviate Vector /Redis Docstore
wait - i think i see what i need to do. will report back. You sent me on the right path. thank you
Yea, what I was going to suggest was using docstore.add_documents(nodes) with the nodes that the ingestion pipeline outputs

Then, use that docstore in the prev/next thing
well the docstore is getting populated as part of ingesting pipeline correctly. So I am left with a vector db and a document store.

When I go to spin up the chat engine, it spins up from what was stored previously.

Plain Text
 vector_store=WeaviateVectorStore(weaviate_client=client, index_name=collection_name, text_key="content")

    docstore=RedisDocumentStore.from_host_and_port(
            "localhost", 6379, namespace="MyDocuments"
            )

    storage_context = StorageContext.from_defaults(docstore=docstore)

    rerank = SentenceTransformerRerank(
        model="cross-encoder/ms-marco-MiniLM-L-6-v2", top_n=3
    )


    prevnext = PrevNextNodePostprocessor(
        
docstore=storage_context.docstore,
        num_nodes=1,
        mode="both",
    )


I thought this would work but didn't quite work out
The docstore is only adding the intial input document. But for the prevnext to work, we need to add the output nodes too

Plain Text
docstore = RedisDocumentStore.from_host_and_port(
  "localhost", 6379, namespace="MyDocuments"
)
pipeline = IngestionPipeline(..., docstore=docstore)

nodes = pipeline.run(documents=documents)

# remove embeddings to save space, add to docstore
for node in nodes:
  node.embedding = None
docstore.add_documents(nodes)

prevnext = PrevNextNodePostprocessor(docstore=docstore, ...)
then it will work I think
It seems that the missing piece is that I need the IndexStore along with the DocStore.

Unfortunately my IngestionPipeline isn't accepting index_store although it handles docstore which is what I've been doing.

Referring to
https://docs.llamaindex.ai/en/stable/examples/docstore/RedisDocstoreIndexStoreDemo.html#add-to-docstore
in your example. Would the documents not be added multiple times to docstore?

Once as part of pipeline with
IngestionPipeline(..., docstore=docstore

and another time with
docstore.add_documents(nodes)
So I got all of it working. Still not sure if it is possible or not to add index_store to pipeline ingestion just like vector/doc store.

I feel like I'm duplicate my docstore documents by doing it outside of the ingestion.
You are kind of duplicating, but tbh it's not a huge deal. Ingestion pipeline does not accept an index store, because it's not an index
Here's a demo with an alternative. But it's kind of annoying af πŸ˜… https://github.com/run-llama/llama_index/issues/8832#issuecomment-1805969818
well this does seem to be my exact issue. This is helpful - I'm switching my ingestion to only work off of the FileWatcher and process 1 file through pipeline at a time to prevent duplicates.

Using the inotifywait then pass filename into the ingestion python script with filepath as param
Plain Text
inotifywait -m -r -e create --format '%w%f' "$DIRECTORY_TO_MONITOR" | while read FILE
do
    # Check if the path is a regular file (not a directory)
    if [ -f "$FILE" ]; then
        echo "New file added: $FILE"
        # Pass the file path to the Python script
        python "$PYTHON_SCRIPT" "$FILE"
    else
        echo "Ignored directory change: $FILE"
    fi
done
Add a reply
Sign up and join the conversation on Discord