Find answers from the community

Updated 11 months ago

I got this `Error: Vector store

At a glance
I got this Error: Vector store integrations that store text in the vector store are not supported by ref_doc_info yet. does Qdrant not support this?
L
M
40 comments
nope, since the vector store is storing all the info in vector store (its quite difficult to support that feature without using a docstore)
doesn't it also store docstore for all text?
Not by default, when using a vector store integration
You can override this, but then you need to store that data to disk or somewhere, which complicates storage
Plain Text
index = VectorStoreIndex(..., store_nodes_override=True)

ref_doc_ids = list(self._index_struct.doc_id_to_summary_id.keys())

all_ref_doc_info = {}
for ref_doc_id in ref_doc_ids:
    ref_doc_info = self.docstore.get_ref_doc_info(ref_doc_id)
    if not ref_doc_info:
        continue

    all_ref_doc_info[ref_doc_id] = ref_doc_info
basically I realize that if we put a doc multiple times, it just puts them all ... wish it could avoid doing that rather ... any suggestions?
(that's when I started chkg ref_doc_info ... wish I could avoid it rather)
Maybe use the IngestionPipeline with a docstore and vector store attached, to avoid duplicates?
off to docs land
Plain Text
vector_store = QdrantVectorStore(...)
docstore = SimpleDocumentStore()

pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(),
        HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5"),
    ],
    docstore=SimpleDocumentStore(),
    vector_store=vector_store,
)

pipeline.run(documents=documents)

docstore.persist("docstore.json")

docstore = SimpleDocumentStore.from_persist_path("docstore.json")
thx πŸ‘πŸΌ
IngestionPipeline doesn't solve ref_doc_info ... any alts ... wish I don't hv to manage that info on my own
I wish index.update() and index.refresh() worked in Qdrant ... that could solve the problem
ingestion pipeline is already handling the deduplicate stuff without it. Every time you run the pipeline with a docstore+vector store, its handling an upserts without the need for refresh or update
If you absoulutely need ref doc info, add the output nodes to the docstore as well
hey, that's cool ... thx for the info
so I create the pipeline before creating the index ...
then do pipeline.run before index.insert ...
hope that's fine
If the vector_store is attached, you don't need index.insert -- it will already be in the vector store
I'm assuming this is what u r saying ...
Plain Text
client = ...
vector_store = ...
# storage_context = ...
# index = ...

pipeline = IngestionPipeline(...)
docstore.persist("docstore.json")  docstore = SimpleDocumentStore.from_persist_path("docstore.json")
... when we encounter each doc do ...
Plain Text
doc1 = ...
pipeline.run([doc1])
doc2 = ...
pipeline.run([doc2])
btw, one of the examples had done an insert ...
Plain Text
for document in documents_jerry:
    document.metadata["user"] = "Jerry"

nodes = pipeline.run(documents=documents_jerry)
index.insert_nodes(nodes)
... so just wondering why they followed it up w/ insert
Probably because the example didn't attach a vector store?
Yea this makes sense πŸ‘
I understand that we've created a docstore using this pipeline ... so it'd help upsert docs w/ similar doc_id when put in the store ... am I right?
just curious if one can customize these pipelines too ... for instance if I'd like to incorporate some stuff that I'd like to hv within
you can customize the sequence of transformations! And even implement custom transformations
just curious, shouldn't v b optimizing this into the default indexing (pipeline), so one doesn't hv to create it specifically, irrespective of db? I'm sure u had a reason for not hvg it.
it creates extra storage that has to be persisted (simple docstore saves to disk, but theres also redis, mongodb, postgres), rather than hiding that fact from the user, its presented upfront.
very true ... thx for that update πŸ‘πŸΌ
btw, I currently do quite some processing before I extract a doc and do pipeline.run ... could it if I do a quick chk of its doc_id in the store ... any suggestions?
You could probably add your processing to the ingestion pipeline itself? Just by implementing a custom transformation
any guidelines for creating a custom transformation?
btw, not sure if this is working when v add the workers ... pipeline.run(documents=[...], num_workers=4)
does it raise some error? I remember getting an error locally myself too, been meaning to fix that
This could help perhaps ...
Error: cannot pickle 'builtins.CoreBPE' object
Add a reply
Sign up and join the conversation on Discord