Find answers from the community

Updated 5 months ago

Has anyone thought about modifying the

Has anyone thought about modifying the property graph code to do semantic chunking and triples extraction at the same time?

My thought process:
  • Parse the document into chunks
  • Recombine all the chunks into one text string but insert metadata denoting each chunk in to the combined chunks text string
so basically original text: "paragraph 1, paragraph 2, paragrah 3" => [{Chunk1 text: "paragraph 1"}, {Chunk2 text: "paragraph 2"}, {Chunk3 text: "paragraph 3"}] =>

"""
Chunk1
=========
paragraph 1
=========

Chunk2
=========
paragraph 2
=========

Chunk3
=========
paragraph 3
=========
"""

  • when you are doing extraction you pass the entire document and instead pullout triples by them selves you prompt the model to pull out a triple plus the chunk ids that it is from
Entity("Paragraph") -- Relationship("HAS_NUMBER") --> Entity("1") extracted from chunk1


I would say allow for / ask for 1-MAX_CHUNKS_PER_TRIPLE to be references via chunk ids when preforming extraction. Now you get the exact location of every single triple plus you get the triples created with the context of the whole document.
L
C
18 comments
Hmm, I'm not sure if I fully follow.

Basically putting more than one chunk into the the input of the LLM? Doesn't this imply that all chunks can fit into a single LLM call?
Usefulness definitely increases as models get longer context and smarter but if the document gets too long then you can still split it on any chunk.
@Logan M Any insight into the path for implementing this?
I basically wanna maximize the amount of text passed to the LLM for entity extraction so the entities are extracted with the most context about the document but also want chunks to stay small and easily queriable using vector embeddings
I mean, it could just be a custom extractor
Basically your extractor would just wrap an exsiting extractor, but do the step to combine (and then un-combine?) text chunks
I'd also need to create a new entity extraction prompt.
ah yea true, need to map to the chunk number
and modify the upload to the pg graph as well
I don't think the upload needs to be modified
all you need is your list of entities and relations no? (and your original text chunks)
@Logan M In relation to the questions above. Where is the best place to modify the KG_EXRACTOR Prompt and pass that in?
Depends on the extractor I suppose, which one are you using?
@Logan M I want to make a custom extractor I think that does the semantic parsing as well. Like I talked about above

kg_extractor = SchemaLLMPathExtractor(
llm=kg_extraction_llm,
possible_entities=entities,
possible_relations=relations,
kg_validation_schema=validation_schema,
strict=True,
max_triplets_per_chunk=10,
)

index = PropertyGraphIndex.from_existing(
property_graph_store=pg_store,
llm=kg_extraction_llm,
kg_extractors=[kg_extractor],
embed_model=embed_model,
embed_kg_nodes=True,
show_progress=False,
)
Right right. So with SchemaLLMPathExtractor

Plain Text
DEFAULT_SCHEMA_PATH_EXTRACT_PROMPT = (
    "Give the following text, extract the knowledge graph according to the provided schema. "
    "Try to limit to the output {max_triplets_per_chunk} extracted paths.s\n"
    "-------\n"
    "{text}\n"
    "-------\n"
)

kg_extractor = SchemaLLMPathExtractor(..., extract_prompt=DEFAULT_SCHEMA_PATH_EXTRACT_PROMPT)
Add a reply
Sign up and join the conversation on Discord