Hi!

At a glance

Hi!

I’ve used the great work the Llama Index and Neo4j team has down with using LLMS to extract nodes and entities from documents (see post https://www.llamaindex.ai/blog/customizing-property-graph-index-in-llamaindex)

Is there a way to have the LLM extract ‘node properties’ and ‘relationship properties’ as well?

If not, happy to contribute to this effort, just point me there.

-Michael

22 comments

LLogan M

technically right now, the properties for kg nodes and relationships are inherited from the source chunk -- whatever metadata is on the source chunk gets used as properties

Happy to have a contribution for something more specific. Just need to be wary of throughput/llm calls. I feel like most implementations will be too slow to be useful

tthekizoch

That sounds good, but in practice i didnt see in https://github.com/tomasonjo/blogs/blob/master/llm/llama_index_neo4j_custom_retriever.ipynb a) a way to define node or relationship properties
or
b) the llm doing it when strict parameter was set to False in the SchemaLLMPathExtractor

Guess ill dig into source code, learn, and build it

LLogan M

Yea its not implemented lol it would have to be a custom extractor.

For now, the properties are just inherited from the source chunk metadata

tthekizoch

this is hard... but i am working on it..

tter_ilias

I also was wandering how the properties are assigned. By reading the source code I saw that it inherits the metadata from the chunks. Indeed it may be helpful for a KG node to have some meaningful properties but this indeed can be time and money costly as requires many calls to the LLM.

LLogan M

Yea that's the biggest barrier imo, very compute expensive

tthekizoch

why can't we ask the the LLM to extract nodes and their properties in a single call (LLM with large context input/output)?

LLogan M

You can. Its just adding latency to an already slow system lol

tthekizoch

by a metric, what would be considered fast enough?

I got a custom extractor working, but its only doing 2 node properties at the moment. 🚀 it's hard coded into the class. I'm working to make it more general so you can pass schemas

tter_ilias

That is what I was thinking of. As we already call the LLM to extract entities and relationships why we don't ask LLM to add properties as well?... This will add more properties instread of all entities and relationships from the same document to have the same properties, right? But ok more tokens to generate is more costly (at time and money)

tthekizoch

right, just you still pay the time, i.e. waiting, with the latency in the response once per call, which stays fixed if we dont add more call. the input tokens (more complex schema) and output tokens are both increased, so processing time and cost went up there

tthekizoch

@Logan M would you say the best approach is a custom extractor? or just passing a custom kg_schema_cls?

Still learning more about codebase... can you confirm if it must be custom extractor?

LLogan M

I'm like 100% sure the existing schema extractor won't take any properties, even if you define them in a custom schema cls. It requires either a PR or a new extractor

tthekizoch

thanks Logan, will keep you posted

tthekizoch

I see your commit from 2 days ago. I was working on it, you beat me to it! awesome job @Logan M

https://github.com/run-llama/llama_index/pull/14707

tthekizoch

@Logan M just so i dont duplicate work again (although i learned a lot going coding my own), did you make a notebook going thru it using the new extraction features? Otherwise happy to and ill post it tomorrow.

LLogan M

I updated an existing notebook (you can see the PR) to kind of show its there, but nothing dedicated

LLogan M

(sorry for going ahead on that, we needed it to replicate graphrag properly lol)

tthekizoch

Dude im not sore at all! glad you did it. I learned a lot about the codebase, Pydantic, and powerful class abstractions that I didn't know before.

tthekizoch

What notebook did you update? got a link? I don't see it PR for llama. just so i can make the dedicated notebook most helpful

LLogan M

This one here https://github.com/run-llama/llama_index/blob/326a2205523dd4d3b6bbf54b563869252c7eeb47/docs/docs/examples/property_graph/Dynamic_KG_Extraction.ipynb

tthekizoch

Hey @Logan M

The note book is at https://github.com/thekizoch/llama_index/blob/main/dev_notebooks/dev.ipynb

But I don't feel comfortable publishing it at all. The last part ( building with strict=True, with properties) doesn't actually extract properties as is and rejects many valid triplets.

I debugged and there are strange validation errors. here's a snippet:

Plain Text

> Triplet rejected: {'subject': {'type': 'PLANT', 'name': 'stinging nettle', 'properties': {'SOURCE': 'Ethnopharmacological knowledge and metabolites'}}, 'relation': {'type': 'TREATS', 'properties': {'EFFECT_STRENGTH': 'moderate', 'EVIDENCE': 'ethnopharmacological', 'DOSAGE': 'varies'}, 'object': {'type': 'DISEASE', 'name': 'health conditions', 'properties': {'SYNONYMS': 'various health issues'}}}}. Reason: [
>   {
>     "loc": [
>       "object"
>     ],
>     "msg": "field required",
>     "type": "value_error.missing"
>   }
> ]

you can see these errors for yourself by forking my repo and running the local forked clone of llama and running https://github.com/thekizoch/llama_index/blob/main/dev_notebooks/test_new_strict.py

Let me know if you want me to PR, or otherwise move discussion to github/elsewhere

-Michael

Add a reply

Find answers from the community

Hi!