Find answers from the community

Updated 2 months ago

Hello everyone! Could someone please

Hello everyone! Could someone please provide some clarification on IngestionPipeline. I am noticing that when i apply multiple transformations, the original document's ID is lost after the SentenceSplitter transformation which ends up inserting new rows into the vector store since the embedding's doc id is the doc id of the nodes from the MarkdownNodeParser transformation instead of the original document.

Is the this not the intended usage? My goal is to be able to split the markdown sections into chunks after parsing to break down long sections in my document, while preserving the original document's ID.

TIA!

Plain Text
pipeline = IngestionPipeline(
    transformations=[
        MarkdownNodeParser(),
        SentenceSplitter(chunk_size=200, chunk_overlap=0),
        OpenAIEmbedding(),
    ],
    vector_store=pg_vector_store,
    docstore=docstore
)
pipeline.run(documents=documents)
L
1 comment
the original ID isn't lost, node.ref_doc_id refers to the parent document.

Just make sure that you have consistent document ids on the input documents, and your docstore will handle the logic for duplicates
Add a reply
Sign up and join the conversation on Discord