Find answers from the community

Updated last year

I got this `Error: Vector store

At a glance

I got this Error: Vector store integrations that store text in the vector store are not supported by ref_doc_info yet. does Qdrant not support this?

40 comments

LLogan M

nope, since the vector store is storing all the info in vector store (its quite difficult to support that feature without using a docstore)

MMaverick

doesn't it also store docstore for all text?

LLogan M

Not by default, when using a vector store integration

LLogan M

You can override this, but then you need to store that data to disk or somewhere, which complicates storage

LLogan M

Plain Text

index = VectorStoreIndex(..., store_nodes_override=True)

ref_doc_ids = list(self._index_struct.doc_id_to_summary_id.keys())

all_ref_doc_info = {}
for ref_doc_id in ref_doc_ids:
    ref_doc_info = self.docstore.get_ref_doc_info(ref_doc_id)
    if not ref_doc_info:
        continue

    all_ref_doc_info[ref_doc_id] = ref_doc_info

MMaverick

thx

MMaverick

basically I realize that if we put a doc multiple times, it just puts them all ... wish it could avoid doing that rather ... any suggestions?

MMaverick

(that's when I started chkg ref_doc_info ... wish I could avoid it rather)

LLogan M

Maybe use the IngestionPipeline with a docstore and vector store attached, to avoid duplicates?

MMaverick

any pointers

LLogan M

off to docs land

LLogan M

Plain Text

vector_store = QdrantVectorStore(...)
docstore = SimpleDocumentStore()

pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(),
        HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5"),
    ],
    docstore=SimpleDocumentStore(),
    vector_store=vector_store,
)

pipeline.run(documents=documents)

docstore.persist("docstore.json")

docstore = SimpleDocumentStore.from_persist_path("docstore.json")

LLogan M

https://docs.llamaindex.ai/en/stable/examples/ingestion/document_management_pipeline.html

MMaverick

thx 👍🏼

MMaverick

IngestionPipeline doesn't solve ref_doc_info ... any alts ... wish I don't hv to manage that info on my own

MMaverick

I wish index.update() and index.refresh() worked in Qdrant ... that could solve the problem

LLogan M

ingestion pipeline is already handling the deduplicate stuff without it. Every time you run the pipeline with a docstore+vector store, its handling an upserts without the need for refresh or update

LLogan M

If you absoulutely need ref doc info, add the output nodes to the docstore as well

MMaverick

hey, that's cool ... thx for the info

MMaverick

so I create the pipeline before creating the index ...
then do pipeline.run before index.insert ...
hope that's fine

LLogan M

If the vector_store is attached, you don't need index.insert -- it will already be in the vector store

MMaverick

I'm assuming this is what u r saying ...

Plain Text

client = ...
vector_store = ...
# storage_context = ...
# index = ...

pipeline = IngestionPipeline(...)
docstore.persist("docstore.json")  docstore = SimpleDocumentStore.from_persist_path("docstore.json")

... when we encounter each doc do ...

Plain Text

doc1 = ...
pipeline.run([doc1])
doc2 = ...
pipeline.run([doc2])

MMaverick

btw, one of the examples had done an insert ...

Plain Text

for document in documents_jerry:
    document.metadata["user"] = "Jerry"

nodes = pipeline.run(documents=documents_jerry)
index.insert_nodes(nodes)

... so just wondering why they followed it up w/ insert

LLogan M

Probably because the example didn't attach a vector store?

LLogan M

Yea this makes sense 👍

MMaverick

Thx

MMaverick

I understand that we've created a docstore using this pipeline ... so it'd help upsert docs w/ similar doc_id when put in the store ... am I right?

MMaverick

just curious if one can customize these pipelines too ... for instance if I'd like to incorporate some stuff that I'd like to hv within

LLogan M

you can customize the sequence of transformations! And even implement custom transformations

LLogan M

yea exactly

MMaverick

just curious, shouldn't v b optimizing this into the default indexing (pipeline), so one doesn't hv to create it specifically, irrespective of db? I'm sure u had a reason for not hvg it.

LLogan M

it creates extra storage that has to be persisted (simple docstore saves to disk, but theres also redis, mongodb, postgres), rather than hiding that fact from the user, its presented upfront.

MMaverick

very true ... thx for that update 👍🏼

MMaverick

btw, I currently do quite some processing before I extract a doc and do pipeline.run ... could it if I do a quick chk of its doc_id in the store ... any suggestions?

LLogan M

You could probably add your processing to the ingestion pipeline itself? Just by implementing a custom transformation

MMaverick

any guidelines for creating a custom transformation?

LLogan M

https://docs.llamaindex.ai/en/stable/module_guides/loading/ingestion_pipeline/transformations.html#id1

MMaverick

btw, not sure if this is working when v add the workers ... pipeline.run(documents=[...], num_workers=4)

LLogan M

does it raise some error? I remember getting an error locally myself too, been meaning to fix that

MMaverick

This could help perhaps ...
Error: cannot pickle 'builtins.CoreBPE' object

Add a reply