Original message:
Hello everyone! Could someone please provide some clarification on IngestionPipeline. I am noticing that when i apply multiple transformations, the original document's ID is lost after the SentenceSplitter transformation which ends up inserting new rows into the vector store since the embedding's doc id is the doc id of the nodes from the MarkdownNodeParser transformation instead of the original document.
Is the this not the intended usage? My goal is to be able to split the markdown sections into chunks after parsing to break down long sections in my document, while preserving the original document's ID.
TIA!
pipeline = IngestionPipeline(
transformations=[
MarkdownNodeParser(),
SentenceSplitter(chunk_size=200, chunk_overlap=0),
OpenAIEmbedding(),
],
vector_store=pg_vector_store,
docstore=docstore
)
pipeline.run(documents=documents)
@Logan M @WhiteFang_Jr
This is a very good question regarding the use of IngestionPipeline, I'm interested in it too
the original ID isn't lost, node.ref_doc_id refers to the parent document.
Just make sure that you have consistent document ids on the input documents, and your docstore will handle the logic for duplicates
My input document does have a fixed doc id. However, when i apply multiple transformations, i can see in the vector store table that the doc_id in the metadata is the random UUID generated from the first transformation instead of the input documentās id.
So if the content changes and the pipeline is rerun, the vector store is not able to delete the old embedding because it cant find the input documentās id.
The docstore should be stopping those documents from running in the first place if they are already recorded in the docstore though
unless you aren't using the same docstore between runs
If I change the original document, the pipeline detects the hash changed and correctly updates the docstore, but fails to find the input document ref_doc_id in the vector store when it tried to delete it.
It works as expected if i remove the sentence splitter, in my example.
what vector store are you using?
hmmm, that should be working. vector_store.delete(ref_doc_id)
should be deleting all nodes associated with that ref doc id
at least, judging by the source code, its doing that
and thats what gets called on upserts
seems like an issue with the docstore? Idk man
Hopefully the below example helps clear up the behaviour I am seeing.
Document(id=123). <--- Input doc to pipeline
|
MarkdownNodeParser
|
Node(id=uuidfor MarkdownNodeParser, ref_doc_id=123)
|
SentenceSplitter
|
Node(id=uuidfor SentenceSplitter, ref_doc_id=uuidfor MarkdownNodeParser)
|
Embedding saved with metadata column have doc_id = uuidfor MarkdownNodeParser instead of Document(id=123)
and if i remove the SentenceSplitter, it works as expected
however some of my markdowns sections are a bit long so i would like to break them down with SentenceSplitter
Yea, I'm just saying if the docstore has already seen and recorded Document(id=123)
, and the hash is the same, it will be skipped (or if the hash is different, it will upsert)
The docstore controls this logic, so I think thats where the issue is
Yea and it correctly upserts the docstore when the hash of the original document changes, but from what I observed, the vector store is unable to delete the embedding since
is not in the embeddings metadata since the ref doc id changed after the SentenceSplitter created new nodes from the MarkdownNodeParser output nodes (which are considered ref docs).
For the moment, I have subclassed MarkdownNodeParser and added the split logic to it to get around the issue.
However, I think it would be ideal to be able to preserve the original input doc's ID as the source node for the intermediary nodes created in the ingestion pipeline.