Unstructured-to-structured was definitely where my mind went to at first, especially since the "facts" i'm interested in will be fixed points since they're events that have already occurred.
I've tried a few experiments with varying degrees of success but generally not getting the performance I was looking for:
1) Prompting a llama model to extract relevant fields from an article.
- Direct prompts had varying degrees of success, although performance generally got better moving from 7B->70B. Fundamentally ensuring that all references in an article with their relevant prices was becoming a problem. E.g. an article mentions more than one building and more than one price, but the prices of those buildings are then mentioned in later sentences so incorrect prices are tagged to buildings. Maybe a model issue, maybe a prompt issue.
2) Formatting to a structured output. Prompting for "give me in JSON/CSV/etc" generally didn't work for me.
- Maybe the base llama models just don't do this well, or maybe my prompts were too excessive for what the model could actually achieve given I was trying to extract several pieces of information at once.
- Further experiments with kor to enforce a specific schema did have slightly better results although then the problem becomes providing enough few-shot examples for it to generalise the problem but still have enough context left to fit a new article and its generated output
There's definitely validity in using metadata to provide additional context to the document. I think the issue there is that there would still need to be some sort of pre-processing step to generate the right metadata tag for a document which I think brings me back to an unstructured-to-structured problem fundamentally.
Just wondering how others maybe have tackled trying to generate structured data from long-form articles?