Find answers from the community

Updated last year

Hey I m slightly confused about

Hey I'm slightly confused about Documents and Nodes, and the way that my LLM 'sees' those nodes. I am creating a custom list of Documents, which I afterwards put into a NodeParser to call get_nodes_from_documents. Afterwards I used this code to check what my LLM is seeing.

Plain Text
from llama_index.schema import MetadataMode
document = tax_nodes[12] # Random sample from nodeparser
print("The LLM sees this: \n", document.get_content(metadata_mode=MetadataMode.LLM))


The output confuses me atm (shortened for convenience)
Plain Text
The LLM sees this: 
[Excerpt from document]
Chapter: chapter II.
Article: Article 12
Paragraph: Paragraph 1
document_title: <lorem ipsum>
prev_section_summary: <lorem ipsum>
Excerpt:
----
Metadata:
----
Content: <content>
----

I don't really understand how to interpret this. The top part of the print [excerpt from document] clearly shows my metadata. But the actual heading with Metadata: remains empty. Content does contain all the text as expected.
L
O
59 comments
Is this coming from the metadata extractors?
Or did you add this metadata yourself?
It's a combination of both
Plain Text
def create_document(text, metadata):
    document = Document(
        text=" ".join(text),
        metadata=metadata,
        metadata_separator="::",
        metadata_template="{key}=>{value}",
        text_template="Metadata: {metadata_str}\n-----\nContent: {content}",
    )
    return document
Here I create my own documents, providing a dictionary of metadata
Plain Text
metadata_extractor = MetadataExtractor(
    extractors=[
        TitleExtractor(nodes=3, node_template=DEFAULT_TITLE_NODE_TEMPLATE, combine_template=DEFAULT_TITLE_COMBINE_TEMPLATE),
        KeywordExtractor(keywords=2),
        SummaryExtractor(summaries=["prev", "self"], prompt_template=DEFAULT_SUMMARY_EXTRACT_TEMPLATE),
     ],
)

node_parser = SimpleNodeParser(
    metadata_extractor=metadata_extractor
)

tax_nodes= node_parser.get_nodes_from_documents(documents)
I then use this piece of code
Which I guess turns the docs into nodes, it seems like the metadata is still present after this
hmm I wonder of the extracors are messing around with the metadata formatting? Taking a peek at the code
This is one of the nodes from the code above
Attachment
image.png
Old metadata seems intact
Which explains the excerpt thing
Ah! That explains that, I wasnt able to find those in the Document class https://github.com/jerryjliu/llama_index/blob/main/llama_index/schema.py
Seems easy to disable at least
MetadataExtractor(extractors=[...], disable_template_rewrite=True)
Let me try that real quick
Still the same result
Plain Text
from llama_index.schema import MetadataMode
test_node = tax_nodes[2]
print("The LLM sees this: \n", test_node.get_content(metadata_mode=MetadataMode.LLM))
# print(document)


Plain Text
The LLM sees this: 
 Metadata: Hoofdstuk: Hoofdstuk II. Heffing ter zake van leveringen en diensten
Artikel: Artikel 2
Afdeling: Afdeling 1. Belastbaar feit
Paragraaf: 
document_title: Harmonisatie van de Europese Unie Wetgevingen inzake Belasting op Leveringen van Goederen en Diensten, Intracommunautaire Verwervingen en Invoer van Goederen: Wet op de Omzetbelasting 1968
excerpt_keywords:  BTW-richtlijn 2006, accijnsgoederen
prev_section_summary: 
<summary>
section_summary: 
<summary>
-----
Content: Metadata: 
-----
Content: <content>
What is going on 😡 One sec, going to run this myself lol
nah I got a test file already
hmmm I should have picked a smaller example LOL
restarting with a single small document
Takes approx 20 minutes and $8 in openAI costs to parse my full text lol
You love to see it
I know we want to make some of this paralell, but then rate limits become an issue..
I blame the summaries though, think that's the main driving force behind the cost and usage
I haven't had as many rate limits as I did a few months ago with llamaindex though!
I feel like in march / april rate limits were a much bigger issue, even when just building plain indices with nothing special
hmmm, Ok, I think i've ran into the same issue you have with the template getting all messed up. Trying to figure out how that happens now
it seems like some of the metadata is ending up in the actual text πŸ€¦β€β™‚οΈ
(Pdb) nodes[0].text
'Metadata: \n-----\nContent: \nContext\nLLMs are....'
that's not right
and all the template/seperator stuff does not get inherited properly..
time to pdb.set_trace() and step through the code lol
Hopefully it doesnt take too long
I'm gonna get some dinner rn, I'll be back later tonight
Best of luck, and thanks for the effort so far!
Sounds good!
Once again very much appreciated
Will let you know what I find lol, And no worries!
Fixed the issue! There was a few compounding issues

  • weird behaviour for getting the node content when metadata mode is NONE and the template is customized
  • metadata template was missed in the inheritance
  • too many newlines in the extracted metadata
Plain Text
(Pdb) print(nodes[0].get_content(metadata_mode=MetadataMode.LLM))
Metadata: test=>val
entities=>{'LangChain', 'SQL', 'Flask', 'Docker'}
excerpt_keywords=>LLMs, LlamaIndex
section_summary=>LlamaIndex is a data framework that provides tools to help augment LLMs with private data. It offers data connectors to ingest data sources and formats, provides ways to structure data, an advanced retrieval/query interface, and easy integrations with outer application frameworks. It is suitable for both beginner and advanced users, with a high-level API for the former and lower-level APIs for the latter.
-----
Content: 
LLMs are a phenomenonal piece of technology for knowledge generation and reasoning. 
They are pre-trained on large amounts of publicly available data.
How do we best augment LLMs with our own private data?
You're amazing! Thanks very much for the effort
Just updated my code w/ the new release and everything works as expected! It also seems that the cost of parsing nodes has drastically gone down?? Instead of several $$s to parse all Documents to Nodes and build an index I now spent 0.1$?

Not sure if related to this bugfix but a nice addition nonetheless
Were you setting an LLM? If not, the default LLM actually changed to gpt-3.5 from text-davinci-003, which is muuuuch cheaper
But we actually just reverted the change in favour of doing a bigger 0.8.0 release lol
Let me know if you encounter any weirdness in the prompts though! We also updated the internal prompts to hopefully work better with gpt-3.5
Nope, I used the default LLM so that explains!
So far the prompts are good, I'm rewriting all the prompts to Dutch rn in favour of receiving a native response so I don't think I'd notice regardless
However I did encounter some weirdness with the LLM where I asked about a specific Article which was clearly provided in the context but it wasn't able to synthesize a response with it. This was before I used native prompts
You encountered that weirdness when using custom prompts you mean? Or the weirdness happened with the new default prompts?
I think it was with the default prompts...? If I encounter the behaviour again I'll shoot a message. So far, so good!
Add a reply
Sign up and join the conversation on Discord