Can you fine tune a query engine? For example I have a CSV file with building lot property development standards i.e. "Front setback primary is 50ft if lot width is greater than 100 feet otherwise Front setback secondary is 20ft". My query is something like. "What is the Front setback distance for a lot if the lot width is 155 feet and the lot area is 0.5 acre?" But the query engine is getting confused about whether to select "Primary" over "Secondary".
Can I use the guidance question generator to analyze both trained data (prompt and completion) and context data (complete context to derive answer from)?
Not sure I understand. To explain further I have vectorized a csv file with columns โtitleโ, โheadingโ and โcontentโ using openaiโs embeddings API. Then using cosine similarity to pick โrowsโ from this data to create a context to answer the prompt. I have been using this approach on a small test csv file to some success albeit inconsistent but I want use llama_index on the large pdf that the data in this csv file was derived from. (The test file had 3 building zones to check to find its answer, the large pdf has 50+)
Thanks @Logan M the CSVReader is great when you have no merged cells but doesnt otherwise. Is there a way for it to work on merged tables? Example image and csv file attached
@Logan M I am using pandas to do this as a work around atm.
At a high level I am trying to convert a large pdf (800 pages) to a LLM friendly Document. This Document is regulatory in nature, as you nodoubt have noticed, which means it has a combination of text, hyperlinks, tables with merged cells, tables with unmerged cells. This would be a one of excercise to have a clean Document for prompts but my question is: Can I use construct a Document and pass in different types of data readers or/and pandas data juggling, etc. Am I asking for something like OpenAI's function call API? I dont know. Can I achieve this with Llama_Index? If so can you you please point me in the right direction?
Do you have an example of loading a rich media pdf (or html) similar into a LLM friendly Document/s? OR should I just be splitting these into many seperate Documents and using the SubQuestionQueryEngine and feed it an array of Documents?
I don't have a rich media example, it's uncharted territory here lol Gotta sail your own way tbh
Also btw, SubQuestionQueryEngine takes a list of query engines, not a list of documents ๐ You probably want to split your documents into a few indexes, but I would try to segregate to around 3-6 categories if possible