LlamaIndex

Log inLog into community

Find answers from the community

Updated 6 months ago

Hi guys, wondering one thing.

Hi guys, wondering one thing.

At a glance

·

Hi guys, wondering one thing.

I have these pipelines, however they create new data every time they are run (so after 3 runs and retrieval top_k = 3 they all retrieve the same text)...

Why?

pipelines = {
"QA": IngestionPipeline(
transformations=[
SentenceSplitter(paragraph_separator="\n\n\n", chunk_size=300, chunk_overlap=20),
TitleExtractor(),
OpenAIEmbedding(model="text-embedding-3-large"),
],
vector_store=self.vector_store,
cache=IngestionCache(),
),
"Klubista": IngestionPipeline(
transformations=[
SentenceSplitter(chunk_size=400, chunk_overlap=50),
TitleExtractor(),
OpenAIEmbedding(model="text-embedding-3-large"),
],
vector_store=self.vector_store,
cache=IngestionCache(),
),
"PrevadzkovyPoriadok": IngestionPipeline(
transformations=[
SentenceSplitter(chunk_size=400, chunk_overlap=50),
TitleExtractor(),
OpenAIEmbedding(model="text-embedding-3-large"),
],
vector_store=self.vector_store,
cache=IngestionCache(),
),
"OtherDocs": IngestionPipeline(
transformations=[
SentenceSplitter(chunk_size=400, chunk_overlap=50),
TitleExtractor(),
OpenAIEmbedding(model="text-embedding-3-large"),
],
vector_store=self.vector_store,
cache=IngestionCache(),
),
}

L

m

33 comments

I'm not sure what you mean?

If you run a document through a pipeline, it will create nodes 👀

so how to avoid that?

I thought that it won't do it if i have the same nodes in the vector database already

but it does

You should attach a docstore to the pipeline if you want to track upserts/duplicates

would you mind helping me how?

i thought vectorstore is enough

I dont have docstore, just vectorstore?

vector store is not enough. Its not tracking which documents have been inserted + their hashes

you need a docstore for this

in addition to a vector store

https://docs.llamaindex.ai/en/stable/examples/ingestion/document_management_pipeline.html

So i have to persist both vector store and doucment store?

https://docs.llamaindex.ai/en/stable/module_guides/storing/docstores.html

probably yes

self.docstore.persist('docstore_db5')

if os.path.exists('docstore_db5'):
self.docstore = SimpleDocumentStore.from_persist_path('docstore_db5')
else:
self.docstore = SimpleDocumentStore()

this does not work :/

works fine for me?

Plain Text

>>> from llama_index.core.storage.docstore import SimpleDocumentStore
>>> docstore = SimpleDocumentStore()
>>> docstore.persist("./docstore_db5.json")
>>> docstore = SimpleDocumentStore.from_persist_path("./docstore_db5.json")

well, the docstore works but the ingestion pipeline still inserts new docs (the same ones).

Its dependendant on the documents having the same doc id

Is that true for you?

How can I set the doc id? I don't change anything just run the same rpogram twice.

the data is the same.

This is my code (apart the thing you sent):

pipelines = {
"QA": IngestionPipeline(
transformations=[
SentenceSplitter(paragraph_separator="\n\n\n", chunk_size=300, chunk_overlap=20),
TitleExtractor(),
OpenAIEmbedding(model="text-embedding-3-large"),
],
docstore=self.docstore,
vector_store=self.vector_store,
cache=IngestionCache(),
),
"Klubista": IngestionPipeline(
transformations=[
SentenceSplitter(chunk_size=400, chunk_overlap=50),
TitleExtractor(),
OpenAIEmbedding(model="text-embedding-3-large"),
],
docstore=self.docstore,
vector_store=self.vector_store,
cache=IngestionCache(),
),
"PrevadzkovyPoriadok": IngestionPipeline(
transformations=[
SentenceSplitter(chunk_size=400, chunk_overlap=50),
TitleExtractor(),
OpenAIEmbedding(model="text-embedding-3-large"),
],
vector_store=self.vector_store,
docstore=self.docstore,
cache=IngestionCache(),
),
"OtherDocs": IngestionPipeline(
transformations=[
SentenceSplitter(chunk_size=400, chunk_overlap=50),
TitleExtractor(),
OpenAIEmbedding(model="text-embedding-3-large"),
],
vector_store=self.vector_store,
docstore=self.docstore,
cache=IngestionCache(),
),
}

qa_text = docx2txt.process("data/qa.docx")
prevadzkovy_txt = Path("data/prevadzkovy.txt").read_text()
klubista_txt = Path("data/klubista.txt").read_text()

qa_doc = Document(text=qa_text)
prevadzkovy_txt_doc = Document(text=prevadzkovy_txt)
klubista_txt_doc = Document(text=klubista_txt)

await pipelines["QA"].arun(documents=[qa_doc])
await pipelines["Klubista"].arun(documents=[klubista_txt_doc])
await pipelines["PrevadzkovyPoriadok"].arun(documents=[prevadzkovy_txt_doc])

for doc in os.listdir("data/other_docs"):
print(doc)
txt = docx2txt.process(f"data/other_docs/{doc}")
txt_doc = Document(text=txt)
await pipelines["OtherDocs"].arun(documents=[txt_doc])

self.docstore.persist('docstore_db5.json')

They have different doc ids... i mean in the json, "131dfe02-3ea9-45a0-ad03-cb3b3a07b1cf": {

and 2d075cb4-25e5-4405-bb18-60f1193c7bee": {

Why though?

ah maybe because i am procesisng the text and then using the pipeline instead of loading from the doc directly

unfortunately that didnt help

well, there is a bug or otherwise i am completely lost.

it loads the persist dir, pipelin goes through anyway.

okay, i got it... i didnt include filename_as_id=True ;-)... i will maybe upload the ingestionpipeline docs cuz thats misleading

The docs I linked specifically emphasize needing a consistent doc ID 👀 They also use that argument

But glad you got it!

sorryt, then i am overworked:)

Add a reply

Sign up and join the conversation on Discord