Can you fine tune a query engine

At a glance

Can you fine tune a query engine?
For example I have a CSV file with building lot property development standards i.e. "Front setback primary is 50ft if lot width is greater than 100 feet otherwise Front setback secondary is 20ft".
My query is something like. "What is the Front setback distance for a lot if the lot width is 155 feet and the lot area is 0.5 acre?"
But the query engine is getting confused about whether to select "Primary" over "Secondary".

Can I use the guidance question generator to analyze both trained data (prompt and completion) and context data (complete context to derive answer from)?

9 comments

LLogan M

Are you using a vector index? or a sql index?

SSimon 📐🛠

Not sure I understand. To explain further I have vectorized a csv file with columns “title”, “heading” and “content” using openai’s embeddings API.
Then using cosine similarity to pick “rows” from this data to create a context to answer the prompt.
I have been using this approach on a small test csv file to some success albeit inconsistent but I want use llama_index on the large pdf that the data in this csv file was derived from. (The test file had 3 building zones to check to find its answer, the large pdf has 50+)

LLogan M

So you are using a vector index 👍

Ngl though, I'm not sure how to improve this 🤔 I would try using a paged csv loader, it formats the data in a nicer way for this task

https://llama-hub-ui.vercel.app/l/file-paged_csv

SSimon 📐🛠

Thanks @Logan M the CSVReader is great when you have no merged cells but doesnt otherwise. Is there a way for it to work on merged tables? Example image and csv file attached

Attachment

SSimon 📐🛠

@Logan M I am using pandas to do this as a work around atm.

At a high level I am trying to convert a large pdf (800 pages) to a LLM friendly Document.
This Document is regulatory in nature, as you nodoubt have noticed, which means it has a combination of text, hyperlinks, tables with merged cells, tables with unmerged cells.
This would be a one of excercise to have a clean Document for prompts but my question is:
Can I use construct a Document and pass in different types of data readers or/and pandas data juggling, etc.
Am I asking for something like OpenAI's function call API? I dont know.
Can I achieve this with Llama_Index? If so can you you please point me in the right direction?

LLogan M

hmmm that's going to be really tricky haha

Pandas is the right approach. For this, having your own document loading pipeline is fine 🙂

Not sure what you mean at the end though. We have Doucment (and ImageDocument) that you could use in any data-loading pipeline you create

SSimon 📐🛠

Do you have an example of loading a rich media pdf (or html) similar into a LLM friendly Document/s?
OR should I just be splitting these into many seperate Documents and using the SubQuestionQueryEngine and feed it an array of Documents?

LLogan M

I don't have a rich media example, it's uncharted territory here lol Gotta sail your own way tbh

Also btw, SubQuestionQueryEngine takes a list of query engines, not a list of documents 😉 You probably want to split your documents into a few indexes, but I would try to segregate to around 3-6 categories if possible

SSimon 📐🛠

Oh okay i think I need to read further into your Docs. Maybe I'll use the KG Index alongside VectorIndex after cleaning the data

Add a reply

Find answers from the community

Can you fine tune a query engine