LlamaIndex

Log inLog into community

Find answers from the community

Updated last year

Hey I m slightly confused about

Hey I m slightly confused about

At a glance

OOverclockedClock

·

Hey I'm slightly confused about Documents and Nodes, and the way that my LLM 'sees' those nodes. I am creating a custom list of Documents, which I afterwards put into a NodeParser to call get_nodes_from_documents. Afterwards I used this code to check what my LLM is seeing.

Plain Text

from llama_index.schema import MetadataMode
document = tax_nodes[12] # Random sample from nodeparser
print("The LLM sees this: \n", document.get_content(metadata_mode=MetadataMode.LLM))

The output confuses me atm (shortened for convenience)

Plain Text

The LLM sees this: 
[Excerpt from document]
Chapter: chapter II.
Article: Article 12
Paragraph: Paragraph 1
document_title: <lorem ipsum>
prev_section_summary: <lorem ipsum>
Excerpt:
----
Metadata:
----
Content: <content>
----

I don't really understand how to interpret this. The top part of the print [excerpt from document] clearly shows my metadata. But the actual heading with Metadata: remains empty. Content does contain all the text as expected.

L

O

59 comments

Is this coming from the metadata extractors?

Or did you add this metadata yourself?

OOverclockedClock

It's a combination of both

OOverclockedClock

Plain Text

def create_document(text, metadata):
    document = Document(
        text=" ".join(text),
        metadata=metadata,
        metadata_separator="::",
        metadata_template="{key}=>{value}",
        text_template="Metadata: {metadata_str}\n-----\nContent: {content}",
    )
    return document

OOverclockedClock

Here I create my own documents, providing a dictionary of metadata

OOverclockedClock

Attachment

OOverclockedClock

Plain Text

metadata_extractor = MetadataExtractor(
    extractors=[
        TitleExtractor(nodes=3, node_template=DEFAULT_TITLE_NODE_TEMPLATE, combine_template=DEFAULT_TITLE_COMBINE_TEMPLATE),
        KeywordExtractor(keywords=2),
        SummaryExtractor(summaries=["prev", "self"], prompt_template=DEFAULT_SUMMARY_EXTRACT_TEMPLATE),
     ],
)

node_parser = SimpleNodeParser(
    metadata_extractor=metadata_extractor
)

tax_nodes= node_parser.get_nodes_from_documents(documents)

OOverclockedClock

I then use this piece of code

OOverclockedClock

Which I guess turns the docs into nodes, it seems like the metadata is still present after this

hmm I wonder of the extracors are messing around with the metadata formatting? Taking a peek at the code

OOverclockedClock

This is one of the nodes from the code above

Attachment

OOverclockedClock

Old metadata seems intact

So first thing I notice is, the extractor changes the node text template
https://github.com/jerryjliu/llama_index/blob/main/llama_index/node_parser/extractors/metadata_extractors.py#L110

Which explains the excerpt thing

OOverclockedClock

Ah! That explains that, I wasnt able to find those in the Document class https://github.com/jerryjliu/llama_index/blob/main/llama_index/schema.py

Seems easy to disable at least

MetadataExtractor(extractors=[...], disable_template_rewrite=True)

OOverclockedClock

Let me try that real quick

OOverclockedClock

Still the same result

OOverclockedClock

Plain Text

from llama_index.schema import MetadataMode
test_node = tax_nodes[2]
print("The LLM sees this: \n", test_node.get_content(metadata_mode=MetadataMode.LLM))
# print(document)

Plain Text

The LLM sees this: 
 Metadata: Hoofdstuk: Hoofdstuk II. Heffing ter zake van leveringen en diensten
Artikel: Artikel 2
Afdeling: Afdeling 1. Belastbaar feit
Paragraaf: 
document_title: Harmonisatie van de Europese Unie Wetgevingen inzake Belasting op Leveringen van Goederen en Diensten, Intracommunautaire Verwervingen en Invoer van Goederen: Wet op de Omzetbelasting 1968
excerpt_keywords:  BTW-richtlijn 2006, accijnsgoederen
prev_section_summary: 
<summary>
section_summary: 
<summary>
-----
Content: Metadata: 
-----
Content: <content>

What is going on 😵 One sec, going to run this myself lol

OOverclockedClock

You want my code?

nah I got a test file already

hmmm I should have picked a smaller example LOL

restarting with a single small document

OOverclockedClock

Lmao I felt that

OOverclockedClock

Takes approx 20 minutes and $8 in openAI costs to parse my full text lol

OOverclockedClock

You love to see it

I know we want to make some of this paralell, but then rate limits become an issue..

OOverclockedClock

I blame the summaries though, think that's the main driving force behind the cost and usage

OOverclockedClock

I haven't had as many rate limits as I did a few months ago with llamaindex though!

OOverclockedClock

I feel like in march / april rate limits were a much bigger issue, even when just building plain indices with nothing special

hmmm, Ok, I think i've ran into the same issue you have with the template getting all messed up. Trying to figure out how that happens now

it seems like some of the metadata is ending up in the actual text 🤦‍♂️

(Pdb) nodes[0].text
'Metadata: \n-----\nContent: \nContext\nLLMs are....'

that's not right

and all the template/seperator stuff does not get inherited properly..

time to pdb.set_trace() and step through the code lol

OOverclockedClock

😅

OOverclockedClock

Hopefully it doesnt take too long

OOverclockedClock

I'm gonna get some dinner rn, I'll be back later tonight

OOverclockedClock

Best of luck, and thanks for the effort so far!

Sounds good!

OOverclockedClock

Once again very much appreciated

Will let you know what I find lol, And no worries!

OOverclockedClock

Ty!

Fixed the issue! There was a few compounding issues

weird behaviour for getting the node content when metadata mode is NONE and the template is customized
metadata template was missed in the inheritance
too many newlines in the extracted metadata

Plain Text

(Pdb) print(nodes[0].get_content(metadata_mode=MetadataMode.LLM))
Metadata: test=>val
entities=>{'LangChain', 'SQL', 'Flask', 'Docker'}
excerpt_keywords=>LLMs, LlamaIndex
section_summary=>LlamaIndex is a data framework that provides tools to help augment LLMs with private data. It offers data connectors to ingest data sources and formats, provides ways to structure data, an advanced retrieval/query interface, and easy integrations with outer application frameworks. It is suitable for both beginner and advanced users, with a high-level API for the former and lower-level APIs for the latter.
-----
Content: 
LLMs are a phenomenonal piece of technology for knowledge generation and reasoning. 
They are pre-trained on large amounts of publicly available data.
How do we best augment LLMs with our own private data?

https://github.com/jerryjliu/llama_index/pull/7216

OOverclockedClock

You're amazing! Thanks very much for the effort

OOverclockedClock

Just updated my code w/ the new release and everything works as expected! It also seems that the cost of parsing nodes has drastically gone down?? Instead of several $$s to parse all Documents to Nodes and build an index I now spent 0.1$?

Not sure if related to this bugfix but a nice addition nonetheless

Were you setting an LLM? If not, the default LLM actually changed to gpt-3.5 from text-davinci-003, which is muuuuch cheaper

But we actually just reverted the change in favour of doing a bigger 0.8.0 release lol

Let me know if you encounter any weirdness in the prompts though! We also updated the internal prompts to hopefully work better with gpt-3.5

OOverclockedClock

Nope, I used the default LLM so that explains!

OOverclockedClock

So far the prompts are good, I'm rewriting all the prompts to Dutch rn in favour of receiving a native response so I don't think I'd notice regardless

OOverclockedClock

However I did encounter some weirdness with the LLM where I asked about a specific Article which was clearly provided in the context but it wasn't able to synthesize a response with it. This was before I used native prompts

You encountered that weirdness when using custom prompts you mean? Or the weirdness happened with the new default prompts?

OOverclockedClock

I think it was with the default prompts...? If I encounter the behaviour again I'll shoot a message. So far, so good!

Add a reply

Sign up and join the conversation on Discord