Yes I am doing that, and works great

At a glance

The community members are discussing an issue with the IngestionPipeline in a text processing system. When applying multiple transformations, such as a MarkdownNodeParser and a SentenceSplitter, the original document's ID is lost after the SentenceSplitter transformation. This results in the vector store inserting new rows with the doc_id of the nodes from the MarkdownNodeParser transformation instead of the original document's ID.

The community members suggest that the original ID is not lost, as the node.ref_doc_id refers to the parent document. However, the issue seems to be that when the pipeline is rerun with changes to the original document, the vector store is unable to delete the old embeddings because it cannot find the original document's ID in the metadata.

The community members discuss potential solutions, such as ensuring consistent document IDs on the input documents and using the same docstore between runs. They also suggest that the issue may be related to the docstore logic. One community member has worked around the issue by subclassing the MarkdownNodeParser and adding the split logic to it, but they would prefer to be able to preserve the original input document's ID as the source node for the intermediary nodes created in the ingestion pipeline.

ggamecode8

Yes I am doing that, and works great when I only use a single parser. But if I apply a second transformation like a text splitter, the deduping breaks in the vector store because the ref_doc_id changed after the first transformation in my code snippet.

21 comments

LLeonardo Oliva

Original message:

Hello everyone! Could someone please provide some clarification on IngestionPipeline. I am noticing that when i apply multiple transformations, the original document's ID is lost after the SentenceSplitter transformation which ends up inserting new rows into the vector store since the embedding's doc id is the doc id of the nodes from the MarkdownNodeParser transformation instead of the original document.

Is the this not the intended usage? My goal is to be able to split the markdown sections into chunks after parsing to break down long sections in my document, while preserving the original document's ID.

TIA!

pipeline = IngestionPipeline(
transformations=[
MarkdownNodeParser(),
SentenceSplitter(chunk_size=200, chunk_overlap=0),
OpenAIEmbedding(),
],
vector_store=pg_vector_store,
docstore=docstore
)
pipeline.run(documents=documents)

LLeonardo Oliva

@Logan M @WhiteFang_Jr

This is a very good question regarding the use of IngestionPipeline, I'm interested in it too

LLogan M

the original ID isn't lost, node.ref_doc_id refers to the parent document.

Just make sure that you have consistent document ids on the input documents, and your docstore will handle the logic for duplicates

ggamecode8

My input document does have a fixed doc id. However, when i apply multiple transformations, i can see in the vector store table that the doc_id in the metadata is the random UUID generated from the first transformation instead of the input document’s id.

So if the content changes and the pipeline is rerun, the vector store is not able to delete the old embedding because it cant find the input document’s id.

LLogan M

The docstore should be stopping those documents from running in the first place if they are already recorded in the docstore though

LLogan M

unless you aren't using the same docstore between runs

ggamecode8

If I change the original document, the pipeline detects the hash changed and correctly updates the docstore, but fails to find the input document ref_doc_id in the vector store when it tried to delete it.

It works as expected if i remove the sentence splitter, in my example.

LLogan M

what vector store are you using?

ggamecode8

Im using pgvector

LLogan M

hmmm, that should be working. vector_store.delete(ref_doc_id) should be deleting all nodes associated with that ref doc id

LLogan M

at least, judging by the source code, its doing that

LLogan M

and thats what gets called on upserts

LLogan M

seems like an issue with the docstore? Idk man

ggamecode8

Hopefully the below example helps clear up the behaviour I am seeing.

Document(id=123). <--- Input doc to pipeline
|
MarkdownNodeParser
|
Node(id=uuidfor MarkdownNodeParser, ref_doc_id=123)
|
SentenceSplitter
|
Node(id=uuidfor SentenceSplitter, ref_doc_id=uuidfor MarkdownNodeParser)
|
Embedding saved with metadata column have doc_id = uuidfor MarkdownNodeParser instead of Document(id=123)

ggamecode8

and if i remove the SentenceSplitter, it works as expected

ggamecode8

however some of my markdowns sections are a bit long so i would like to break them down with SentenceSplitter

LLogan M

Yea, I'm just saying if the docstore has already seen and recorded Document(id=123), and the hash is the same, it will be skipped (or if the hash is different, it will upsert)

LLogan M

The docstore controls this logic, so I think thats where the issue is

ggamecode8

Yea and it correctly upserts the docstore when the hash of the original document changes, but from what I observed, the vector store is unable to delete the embedding since

Plain Text

Document(id=123)

is not in the embeddings metadata since the ref doc id changed after the SentenceSplitter created new nodes from the MarkdownNodeParser output nodes (which are considered ref docs).

ggamecode8

For the moment, I have subclassed MarkdownNodeParser and added the split logic to it to get around the issue.

ggamecode8

However, I think it would be ideal to be able to preserve the original input doc's ID as the source node for the intermediary nodes created in the ingestion pipeline.

Add a reply

Find answers from the community

Yes I am doing that, and works great