Parsing

jjaykob

Hi everyone, hoping to have a bit more discussion as I don't think I understand the tools llamaindex provides fully to see how to use them in my problem.
Essentially I'm trying to work with news articles about building constructions. The way i see it is that users could ask two kinds of questions:

Targeted questions about a specific building
Aggregated questions about a company or area or set of buildings

Targeted questions seem straightfoward enough. Assuming the information I'm looking for is within a single article, and I've embedded that entire article it should be returned as context and the LLM can parse out an answer.
Aggregated questions are where my knowledge starts to fall down. I'm having trouble understanding how I'd be able to process and parse potentially hundreds of articles to answer a question like "Which buildings did XYZ build?"
Does the key in this lie in how I'm processing my documents into nodes? Into using stacked indices? I'd appreciate any thoughts or feedback from people who may have encountered similar issues

6 comments

LLogan M

Hmm, if you can anticipate the types of aggregation queries, you could use something like a pydantic program to extract structured data, and then insert that into a db for text2sql

Otherwise, you'd have to use a list index, which is going to read every node/document for every query 😅

so I think there needs to be some initial step to structure the data a bit first

bbmax

couldn't you store metadata on each document if it contains info about the specific building or the if it's apart of an area of buildings and then filter by that?

jjaykob

Unstructured-to-structured was definitely where my mind went to at first, especially since the "facts" i'm interested in will be fixed points since they're events that have already occurred.
I've tried a few experiments with varying degrees of success but generally not getting the performance I was looking for:

1) Prompting a llama model to extract relevant fields from an article.

Direct prompts had varying degrees of success, although performance generally got better moving from 7B->70B. Fundamentally ensuring that all references in an article with their relevant prices was becoming a problem. E.g. an article mentions more than one building and more than one price, but the prices of those buildings are then mentioned in later sentences so incorrect prices are tagged to buildings. Maybe a model issue, maybe a prompt issue.

2) Formatting to a structured output. Prompting for "give me in JSON/CSV/etc" generally didn't work for me.

Maybe the base llama models just don't do this well, or maybe my prompts were too excessive for what the model could actually achieve given I was trying to extract several pieces of information at once.
Further experiments with kor to enforce a specific schema did have slightly better results although then the problem becomes providing enough few-shot examples for it to generalise the problem but still have enough context left to fit a new article and its generated output

There's definitely validity in using metadata to provide additional context to the document. I think the issue there is that there would still need to be some sort of pre-processing step to generate the right metadata tag for a document which I think brings me back to an unstructured-to-structured problem fundamentally.

Just wondering how others maybe have tackled trying to generate structured data from long-form articles?

bbmax

I'm going to be working on something similar -- I think you're right on target.. Pydanic functions are in OpenAI -- not sure if they work w/ Llama.

bbmax

I also would preprocess (even with an llm) each document that gets added for it to generate the metadata

LLogan M

They do, but they rely on structured json output 😅 open-source is less reliable here

Add a reply

Find answers from the community

Parsing