Has anyone thought about modifying the

At a glance

Has anyone thought about modifying the property graph code to do semantic chunking and triples extraction at the same time?

My thought process:

Parse the document into chunks
Recombine all the chunks into one text string but insert metadata denoting each chunk in to the combined chunks text string

so basically original text: "paragraph 1, paragraph 2, paragrah 3" => [{Chunk1 text: "paragraph 1"}, {Chunk2 text: "paragraph 2"}, {Chunk3 text: "paragraph 3"}] =>

"""
Chunk1
=========
paragraph 1
=========

Chunk2
=========
paragraph 2
=========

Chunk3
=========
paragraph 3
=========
"""

when you are doing extraction you pass the entire document and instead pullout triples by them selves you prompt the model to pull out a triple plus the chunk ids that it is from

Entity("Paragraph") -- Relationship("HAS_NUMBER") --> Entity("1") extracted from chunk1

I would say allow for / ask for 1-MAX_CHUNKS_PER_TRIPLE to be references via chunk ids when preforming extraction. Now you get the exact location of every single triple plus you get the triples created with the context of the whole document.

18 comments

LLogan M

Hmm, I'm not sure if I fully follow.

Basically putting more than one chunk into the the input of the LLM? Doesn't this imply that all chunks can fit into a single LLM call?

CCallam

Yes

CCallam

Usefulness definitely increases as models get longer context and smarter but if the document gets too long then you can still split it on any chunk.

CCallam

@Logan M Any insight into the path for implementing this?

CCallam

I basically wanna maximize the amount of text passed to the LLM for entity extraction so the entities are extracted with the most context about the document but also want chunks to stay small and easily queriable using vector embeddings

LLogan M

I mean, it could just be a custom extractor

LLogan M

https://docs.llamaindex.ai/en/stable/module_guides/indexing/lpg_index_guide/#sub-classing-extractors

LLogan M

Basically your extractor would just wrap an exsiting extractor, but do the step to combine (and then un-combine?) text chunks

CCallam

I'd also need to create a new entity extraction prompt.

LLogan M

ah yea true, need to map to the chunk number

CCallam

and modify the upload to the pg graph as well

LLogan M

I don't think the upload needs to be modified

LLogan M

all you need is your list of entities and relations no? (and your original text chunks)

CCallam

yeah

CCallam

@Logan M In relation to the questions above. Where is the best place to modify the KG_EXRACTOR Prompt and pass that in?

LLogan M

Depends on the extractor I suppose, which one are you using?

CCallam

@Logan M I want to make a custom extractor I think that does the semantic parsing as well. Like I talked about above

kg_extractor = SchemaLLMPathExtractor(
llm=kg_extraction_llm,
possible_entities=entities,
possible_relations=relations,
kg_validation_schema=validation_schema,
strict=True,
max_triplets_per_chunk=10,
)

index = PropertyGraphIndex.from_existing(
property_graph_store=pg_store,
llm=kg_extraction_llm,
kg_extractors=[kg_extractor],
embed_model=embed_model,
embed_kg_nodes=True,
show_progress=False,
)

LLogan M

Right right. So with SchemaLLMPathExtractor

Plain Text

DEFAULT_SCHEMA_PATH_EXTRACT_PROMPT = (
    "Give the following text, extract the knowledge graph according to the provided schema. "
    "Try to limit to the output {max_triplets_per_chunk} extracted paths.s\n"
    "-------\n"
    "{text}\n"
    "-------\n"
)

kg_extractor = SchemaLLMPathExtractor(..., extract_prompt=DEFAULT_SCHEMA_PATH_EXTRACT_PROMPT)

Add a reply

Find answers from the community

Has anyone thought about modifying the