Has anyone thought about modifying the property graph code to do semantic chunking and triples extraction at the same time?
My thought process:
- Parse the document into chunks
- Recombine all the chunks into one text string but insert metadata denoting each chunk in to the combined chunks text string
so basically original text: "paragraph 1, paragraph 2, paragrah 3" => [{Chunk1 text: "paragraph 1"}, {Chunk2 text: "paragraph 2"}, {Chunk3 text: "paragraph 3"}] =>
"""
Chunk1
=========
paragraph 1
=========
Chunk2
=========
paragraph 2
=========
Chunk3
=========
paragraph 3
=========
"""
- when you are doing extraction you pass the entire document and instead pullout triples by them selves you prompt the model to pull out a triple plus the chunk ids that it is from
Entity("Paragraph") -- Relationship("HAS_NUMBER") --> Entity("1") extracted from chunk1
I would say allow for / ask for 1-MAX_CHUNKS_PER_TRIPLE to be references via chunk ids when preforming extraction. Now you get the exact location of every single triple plus you get the triples created with the context of the whole document.