codester

Docstore

I'm messing around with Node PostProcessors. Reranking is working fine and i thought I would add in a second stage of trying PrevNext.

Plain Text

    index = VectorStoreIndex.from_vector_store(
        vector_store, service_context=service_context
    )


    prevnext = PrevNextNodePostprocessor(
        docstore=index.docstore,
        num_nodes=1,
        mode="previous",
    )


    rerank = SentenceTransformerRerank(
        model="cross-encoder/ms-marco-MiniLM-L-6-v2", top_n=3)



    chat_engine = index.as_chat_engine(chat_mode="context", memory=memory, verbose=True, streaming=STREAMING, service_context=service_context, similarity_top_k=10, node_postprocessors=[rerank, prevnext], ....

When it gets to the PrevNext post processor it fails with "doc_id 0870c4a7-f226-4988-bc82-1702652d8f7e not found."

I have a feeling it has something to do with the docstore= assignement in prevnext but not sure where I'm going wrong

VectorStore = Weaviate

15 comments

ccodester

Documents

Give an ingestion pipeline with Vector+DocStore+IngestionCache with DocstoreStrategy=UPSERTS over Documents in a recursive directory.

If I run this same ingestion pipeline with Documents = 1 single file what would occur?

Will the other docs be deleted since (I know UPSERTS usually is just UPDATE+INSERT) but just checking.

If the single document file existed in the full processing, will it recognized and only perform the update.

---------------------

Similar question, if i wanted to run a completely different source of documents like youtube transcripts into the same Vector Collection would both ingestion pipelines be able to work without stepping on each others embeddings.

10 comments

ccodester

I'm having an issue getting the

I'm having an issue getting the ingestion pipeline getting to work with Weaviate + Redis Cache (so that I can ingest new documents later).

I had to add index=VectorStoreIndex(nodes, storage_context = storage_context) to get it to load into weaviate but now it seems that the Cache is not taken into account. Without the index line it seemed like it was processing data but never loading.

When using Chroma I was able to get the ingestion pipeline + cache to work (everything the same minus the index = line.

Plain Text

pipeline = IngestionPipeline(
        transformations=[
            SimpleNodeParser(),
#            SentenceSplitter(chunk_size=512, chunk_overlap=20),
#            TitleExtractor(nodes=5),
#            SummaryExtractor(summaries=["prev", "self"]),
#            KeywordExtractor(keywords=10),
#            OpenAIEmbedding(),
        ],
        vector_store=vector_store,
        cache=ingest_cache,
    )


    nodes = pipeline.run(documents=documents, storage_context=storage_context)

    index = VectorStoreIndex(nodes, storage_context = storage_context)

20 comments

Find answers from the community

Docstore

Documents

I'm having an issue getting the