I'm not sure what you mean?
If you run a document through a pipeline, it will create nodes π
I thought that it won't do it if i have the same nodes in the vector database already
You should attach a docstore to the pipeline if you want to track upserts/duplicates
would you mind helping me how?
i thought vectorstore is enough
I dont have docstore, just vectorstore?
vector store is not enough. Its not tracking which documents have been inserted + their hashes
you need a docstore for this
in addition to a vector store
So i have to persist both vector store and doucment store?
self.docstore.persist('docstore_db5')
if os.path.exists('docstore_db5'):
self.docstore = SimpleDocumentStore.from_persist_path('docstore_db5')
else:
self.docstore = SimpleDocumentStore()
this does not work :/
works fine for me?
>>> from llama_index.core.storage.docstore import SimpleDocumentStore
>>> docstore = SimpleDocumentStore()
>>> docstore.persist("./docstore_db5.json")
>>> docstore = SimpleDocumentStore.from_persist_path("./docstore_db5.json")
well, the docstore works but the ingestion pipeline still inserts new docs (the same ones).
Its dependendant on the documents having the same doc id
How can I set the doc id? I don't change anything just run the same rpogram twice.
This is my code (apart the thing you sent):
pipelines = {
"QA": IngestionPipeline(
transformations=[
SentenceSplitter(paragraph_separator="\n\n\n", chunk_size=300, chunk_overlap=20),
TitleExtractor(),
OpenAIEmbedding(model="text-embedding-3-large"),
],
docstore=self.docstore,
vector_store=self.vector_store,
cache=IngestionCache(),
),
"Klubista": IngestionPipeline(
transformations=[
SentenceSplitter(chunk_size=400, chunk_overlap=50),
TitleExtractor(),
OpenAIEmbedding(model="text-embedding-3-large"),
],
docstore=self.docstore,
vector_store=self.vector_store,
cache=IngestionCache(),
),
"PrevadzkovyPoriadok": IngestionPipeline(
transformations=[
SentenceSplitter(chunk_size=400, chunk_overlap=50),
TitleExtractor(),
OpenAIEmbedding(model="text-embedding-3-large"),
],
vector_store=self.vector_store,
docstore=self.docstore,
cache=IngestionCache(),
),
"OtherDocs": IngestionPipeline(
transformations=[
SentenceSplitter(chunk_size=400, chunk_overlap=50),
TitleExtractor(),
OpenAIEmbedding(model="text-embedding-3-large"),
],
vector_store=self.vector_store,
docstore=self.docstore,
cache=IngestionCache(),
),
}
qa_text = docx2txt.process("data/qa.docx")
prevadzkovy_txt = Path("data/prevadzkovy.txt").read_text()
klubista_txt = Path("data/klubista.txt").read_text()
qa_doc = Document(text=qa_text)
prevadzkovy_txt_doc = Document(text=prevadzkovy_txt)
klubista_txt_doc = Document(text=klubista_txt)
await pipelines["QA"].arun(documents=[qa_doc])
await pipelines["Klubista"].arun(documents=[klubista_txt_doc])
await pipelines["PrevadzkovyPoriadok"].arun(documents=[prevadzkovy_txt_doc])
for doc in os.listdir("data/other_docs"):
print(doc)
txt = docx2txt.process(f"data/other_docs/{doc}")
txt_doc = Document(text=txt)
await pipelines["OtherDocs"].arun(documents=[txt_doc])
self.docstore.persist('docstore_db5.json')
They have different doc ids... i mean in the json, "131dfe02-3ea9-45a0-ad03-cb3b3a07b1cf": {
and 2d075cb4-25e5-4405-bb18-60f1193c7bee": {
Why though?
ah maybe because i am procesisng the text and then using the pipeline instead of loading from the doc directly
unfortunately that didnt help
well, there is a bug or otherwise i am completely lost.
it loads the persist dir, pipelin goes through anyway.
okay, i got it... i didnt include filename_as_id=True ;-)... i will maybe upload the ingestionpipeline docs cuz thats misleading
The docs I linked specifically emphasize needing a consistent doc ID π They also use that argument
sorryt, then i am overworked:)