Find answers from the community

Updated 2 months ago

Hi guys, wondering one thing.

Hi guys, wondering one thing.

I have these pipelines, however they create new data every time they are run (so after 3 runs and retrieval top_k = 3 they all retrieve the same text)...

Why?

pipelines = {
"QA": IngestionPipeline(
transformations=[
SentenceSplitter(paragraph_separator="\n\n\n", chunk_size=300, chunk_overlap=20),
TitleExtractor(),
OpenAIEmbedding(model="text-embedding-3-large"),
],
vector_store=self.vector_store,
cache=IngestionCache(),
),
"Klubista": IngestionPipeline(
transformations=[
SentenceSplitter(chunk_size=400, chunk_overlap=50),
TitleExtractor(),
OpenAIEmbedding(model="text-embedding-3-large"),
],
vector_store=self.vector_store,
cache=IngestionCache(),
),
"PrevadzkovyPoriadok": IngestionPipeline(
transformations=[
SentenceSplitter(chunk_size=400, chunk_overlap=50),
TitleExtractor(),
OpenAIEmbedding(model="text-embedding-3-large"),
],
vector_store=self.vector_store,
cache=IngestionCache(),
),
"OtherDocs": IngestionPipeline(
transformations=[
SentenceSplitter(chunk_size=400, chunk_overlap=50),
TitleExtractor(),
OpenAIEmbedding(model="text-embedding-3-large"),
],
vector_store=self.vector_store,
cache=IngestionCache(),
),
}
L
m
33 comments
I'm not sure what you mean?

If you run a document through a pipeline, it will create nodes πŸ‘€
so how to avoid that?
I thought that it won't do it if i have the same nodes in the vector database already
You should attach a docstore to the pipeline if you want to track upserts/duplicates
would you mind helping me how?
i thought vectorstore is enough
I dont have docstore, just vectorstore?
vector store is not enough. Its not tracking which documents have been inserted + their hashes
you need a docstore for this
in addition to a vector store
So i have to persist both vector store and doucment store?
probably yes
self.docstore.persist('docstore_db5')


if os.path.exists('docstore_db5'):
self.docstore = SimpleDocumentStore.from_persist_path('docstore_db5')
else:
self.docstore = SimpleDocumentStore()

this does not work :/
works fine for me?

Plain Text
>>> from llama_index.core.storage.docstore import SimpleDocumentStore
>>> docstore = SimpleDocumentStore()
>>> docstore.persist("./docstore_db5.json")
>>> docstore = SimpleDocumentStore.from_persist_path("./docstore_db5.json")
well, the docstore works but the ingestion pipeline still inserts new docs (the same ones).
Its dependendant on the documents having the same doc id
Is that true for you?
How can I set the doc id? I don't change anything just run the same rpogram twice.
the data is the same.
This is my code (apart the thing you sent):


pipelines = {
"QA": IngestionPipeline(
transformations=[
SentenceSplitter(paragraph_separator="\n\n\n", chunk_size=300, chunk_overlap=20),
TitleExtractor(),
OpenAIEmbedding(model="text-embedding-3-large"),
],
docstore=self.docstore,
vector_store=self.vector_store,
cache=IngestionCache(),
),
"Klubista": IngestionPipeline(
transformations=[
SentenceSplitter(chunk_size=400, chunk_overlap=50),
TitleExtractor(),
OpenAIEmbedding(model="text-embedding-3-large"),
],
docstore=self.docstore,
vector_store=self.vector_store,
cache=IngestionCache(),
),
"PrevadzkovyPoriadok": IngestionPipeline(
transformations=[
SentenceSplitter(chunk_size=400, chunk_overlap=50),
TitleExtractor(),
OpenAIEmbedding(model="text-embedding-3-large"),
],
vector_store=self.vector_store,
docstore=self.docstore,
cache=IngestionCache(),
),
"OtherDocs": IngestionPipeline(
transformations=[
SentenceSplitter(chunk_size=400, chunk_overlap=50),
TitleExtractor(),
OpenAIEmbedding(model="text-embedding-3-large"),
],
vector_store=self.vector_store,
docstore=self.docstore,
cache=IngestionCache(),
),
}

qa_text = docx2txt.process("data/qa.docx")
prevadzkovy_txt = Path("data/prevadzkovy.txt").read_text()
klubista_txt = Path("data/klubista.txt").read_text()
qa_doc = Document(text=qa_text)
prevadzkovy_txt_doc = Document(text=prevadzkovy_txt)
klubista_txt_doc = Document(text=klubista_txt)

await pipelines["QA"].arun(documents=[qa_doc])
await pipelines["Klubista"].arun(documents=[klubista_txt_doc])
await pipelines["PrevadzkovyPoriadok"].arun(documents=[prevadzkovy_txt_doc])

for doc in os.listdir("data/other_docs"):
print(doc)
txt = docx2txt.process(f"data/other_docs/{doc}")
txt_doc = Document(text=txt)
await pipelines["OtherDocs"].arun(documents=[txt_doc])


self.docstore.persist('docstore_db5.json')
They have different doc ids... i mean in the json, "131dfe02-3ea9-45a0-ad03-cb3b3a07b1cf": {

and 2d075cb4-25e5-4405-bb18-60f1193c7bee": {

Why though?
ah maybe because i am procesisng the text and then using the pipeline instead of loading from the doc directly
unfortunately that didnt help
well, there is a bug or otherwise i am completely lost.
it loads the persist dir, pipelin goes through anyway.
okay, i got it... i didnt include filename_as_id=True ;-)... i will maybe upload the ingestionpipeline docs cuz thats misleading
The docs I linked specifically emphasize needing a consistent doc ID πŸ‘€ They also use that argument
But glad you got it!
sorryt, then i am overworked:)
Add a reply
Sign up and join the conversation on Discord