nope, since the vector store is storing all the info in vector store (its quite difficult to support that feature without using a docstore)
doesn't it also store docstore for all text?
Not by default, when using a vector store integration
You can override this, but then you need to store that data to disk or somewhere, which complicates storage
index = VectorStoreIndex(..., store_nodes_override=True)
ref_doc_ids = list(self._index_struct.doc_id_to_summary_id.keys())
all_ref_doc_info = {}
for ref_doc_id in ref_doc_ids:
ref_doc_info = self.docstore.get_ref_doc_info(ref_doc_id)
if not ref_doc_info:
continue
all_ref_doc_info[ref_doc_id] = ref_doc_info
basically I realize that if we put a doc multiple times, it just puts them all ... wish it could avoid doing that rather ... any suggestions?
(that's when I started chkg ref_doc_info
... wish I could avoid it rather)
Maybe use the IngestionPipeline with a docstore and vector store attached, to avoid duplicates?
vector_store = QdrantVectorStore(...)
docstore = SimpleDocumentStore()
pipeline = IngestionPipeline(
transformations=[
SentenceSplitter(),
HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5"),
],
docstore=SimpleDocumentStore(),
vector_store=vector_store,
)
pipeline.run(documents=documents)
docstore.persist("docstore.json")
docstore = SimpleDocumentStore.from_persist_path("docstore.json")
IngestionPipeline doesn't solve ref_doc_info
... any alts ... wish I don't hv to manage that info on my own
I wish index.update()
and index.refresh()
worked in Qdrant ... that could solve the problem
ingestion pipeline is already handling the deduplicate stuff without it. Every time you run the pipeline with a docstore+vector store, its handling an upserts without the need for refresh or update
If you absoulutely need ref doc info, add the output nodes to the docstore as well
hey, that's cool ... thx for the info
so I create the pipeline before creating the index ...
then do pipeline.run
before index.insert
...
hope that's fine
If the vector_store is attached, you don't need index.insert
-- it will already be in the vector store
I'm assuming this is what u r saying ...
client = ...
vector_store = ...
# storage_context = ...
# index = ...
pipeline = IngestionPipeline(...)
docstore.persist("docstore.json") docstore = SimpleDocumentStore.from_persist_path("docstore.json")
... when we encounter each doc do ...
doc1 = ...
pipeline.run([doc1])
doc2 = ...
pipeline.run([doc2])
btw, one of the examples had done an insert ...
for document in documents_jerry:
document.metadata["user"] = "Jerry"
nodes = pipeline.run(documents=documents_jerry)
index.insert_nodes(nodes)
... so just wondering why they followed it up w/ insert
Probably because the example didn't attach a vector store?
Yea this makes sense π
I understand that we've created a docstore
using this pipeline ... so it'd help upsert docs w/ similar doc_id
when put in the store ... am I right?
just curious if one can customize these pipelines too ... for instance if I'd like to incorporate some stuff that I'd like to hv within
you can customize the sequence of transformations! And even implement custom transformations
just curious, shouldn't v b optimizing this into the default indexing (pipeline), so one doesn't hv to create it specifically, irrespective of db? I'm sure u had a reason for not hvg it.
it creates extra storage that has to be persisted (simple docstore saves to disk, but theres also redis, mongodb, postgres), rather than hiding that fact from the user, its presented upfront.
very true ... thx for that update ππΌ
btw, I currently do quite some processing before I extract a doc and do pipeline.run
... could it if I do a quick chk of its doc_id
in the store ... any suggestions?
You could probably add your processing to the ingestion pipeline itself? Just by implementing a custom transformation
any guidelines for creating a custom transformation?
btw, not sure if this is working when v add the workers ... pipeline.run(documents=[...], num_workers=4)
does it raise some error? I remember getting an error locally myself too, been meaning to fix that
This could help perhaps ...
Error: cannot pickle 'builtins.CoreBPE' object