Is this coming from the metadata extractors?
Or did you add this metadata yourself?
It's a combination of both
def create_document(text, metadata):
document = Document(
text=" ".join(text),
metadata=metadata,
metadata_separator="::",
metadata_template="{key}=>{value}",
text_template="Metadata: {metadata_str}\n-----\nContent: {content}",
)
return document
Here I create my own documents, providing a dictionary of metadata
metadata_extractor = MetadataExtractor(
extractors=[
TitleExtractor(nodes=3, node_template=DEFAULT_TITLE_NODE_TEMPLATE, combine_template=DEFAULT_TITLE_COMBINE_TEMPLATE),
KeywordExtractor(keywords=2),
SummaryExtractor(summaries=["prev", "self"], prompt_template=DEFAULT_SUMMARY_EXTRACT_TEMPLATE),
],
)
node_parser = SimpleNodeParser(
metadata_extractor=metadata_extractor
)
tax_nodes= node_parser.get_nodes_from_documents(documents)
I then use this piece of code
Which I guess turns the docs into nodes, it seems like the metadata is still present after this
hmm I wonder of the extracors are messing around with the metadata formatting? Taking a peek at the code
This is one of the nodes from the code above
Old metadata seems intact
Which explains the excerpt thing
Seems easy to disable at least
MetadataExtractor(extractors=[...], disable_template_rewrite=True)
Let me try that real quick
from llama_index.schema import MetadataMode
test_node = tax_nodes[2]
print("The LLM sees this: \n", test_node.get_content(metadata_mode=MetadataMode.LLM))
# print(document)
The LLM sees this:
Metadata: Hoofdstuk: Hoofdstuk II. Heffing ter zake van leveringen en diensten
Artikel: Artikel 2
Afdeling: Afdeling 1. Belastbaar feit
Paragraaf:
document_title: Harmonisatie van de Europese Unie Wetgevingen inzake Belasting op Leveringen van Goederen en Diensten, Intracommunautaire Verwervingen en Invoer van Goederen: Wet op de Omzetbelasting 1968
excerpt_keywords: BTW-richtlijn 2006, accijnsgoederen
prev_section_summary:
<summary>
section_summary:
<summary>
-----
Content: Metadata:
-----
Content: <content>
What is going on π΅ One sec, going to run this myself lol
nah I got a test file already
hmmm I should have picked a smaller example LOL
restarting with a single small document
Takes approx 20 minutes and $8 in openAI costs to parse my full text lol
I know we want to make some of this paralell, but then rate limits become an issue..
I blame the summaries though, think that's the main driving force behind the cost and usage
I haven't had as many rate limits as I did a few months ago with llamaindex though!
I feel like in march / april rate limits were a much bigger issue, even when just building plain indices with nothing special
hmmm, Ok, I think i've ran into the same issue you have with the template getting all messed up. Trying to figure out how that happens now
it seems like some of the metadata is ending up in the actual text π€¦ββοΈ
(Pdb) nodes[0].text
'Metadata: \n-----\nContent: \nContext\nLLMs are....'
and all the template/seperator stuff does not get inherited properly..
time to pdb.set_trace()
and step through the code lol
Hopefully it doesnt take too long
I'm gonna get some dinner rn, I'll be back later tonight
Best of luck, and thanks for the effort so far!
Once again very much appreciated
Will let you know what I find lol, And no worries!
Fixed the issue! There was a few compounding issues
- weird behaviour for getting the node content when metadata mode is
NONE
and the template is customized - metadata template was missed in the inheritance
- too many newlines in the extracted metadata
(Pdb) print(nodes[0].get_content(metadata_mode=MetadataMode.LLM))
Metadata: test=>val
entities=>{'LangChain', 'SQL', 'Flask', 'Docker'}
excerpt_keywords=>LLMs, LlamaIndex
section_summary=>LlamaIndex is a data framework that provides tools to help augment LLMs with private data. It offers data connectors to ingest data sources and formats, provides ways to structure data, an advanced retrieval/query interface, and easy integrations with outer application frameworks. It is suitable for both beginner and advanced users, with a high-level API for the former and lower-level APIs for the latter.
-----
Content:
LLMs are a phenomenonal piece of technology for knowledge generation and reasoning.
They are pre-trained on large amounts of publicly available data.
How do we best augment LLMs with our own private data?
You're amazing! Thanks very much for the effort
Just updated my code w/ the new release and everything works as expected! It also seems that the cost of parsing nodes has drastically gone down?? Instead of several $$s to parse all Documents to Nodes and build an index I now spent 0.1$?
Not sure if related to this bugfix but a nice addition nonetheless
Were you setting an LLM? If not, the default LLM actually changed to gpt-3.5 from text-davinci-003, which is muuuuch cheaper
But we actually just reverted the change in favour of doing a bigger 0.8.0 release lol
Let me know if you encounter any weirdness in the prompts though! We also updated the internal prompts to hopefully work better with gpt-3.5
Nope, I used the default LLM so that explains!
So far the prompts are good, I'm rewriting all the prompts to Dutch rn in favour of receiving a native response so I don't think I'd notice regardless
However I did encounter some weirdness with the LLM where I asked about a specific Article which was clearly provided in the context but it wasn't able to synthesize a response with it. This was before I used native prompts
You encountered that weirdness when using custom prompts you mean? Or the weirdness happened with the new default prompts?
I think it was with the default prompts...? If I encounter the behaviour again I'll shoot a message. So far, so good!