Find answers from the community

Updated last year

querying tabular data

At a glance
hey, is there any way to improve query of tabular data from vector index, more info here
1
r
T
S
29 comments
did you explore text2sql?
I kinda have to make a one size fit all for all kinds of tables so I cant really use sql or structured indexes, so far I've only found using vector index and mapping each row to a document fits my usecase, are there any better ways?
Okay. What is the problem you are facing here?
Where does it fail?
the issue is vector searching is not able to query my structured data accurately, Im using chromadb, example:

q: what is the ship city of order with order id: 405-8078784-5731545.
what it found: documents : [Order ID: 408-5748499-6859555, Order ID: 408-7955685-3083534], distances: [[0.3717946789733443, 0.38357981654556134]]

Maybe llamaindex can provide some kind of combined text search and vector search to improve this? πŸ€”
Attachment
image.png
Okay if you are querying this kind of table text2sql is best option but if you are looking for text2sql + RAG you can check sqljoinquery engine or sqlautoquery engine
@ravitheja i am facing similar issue but I am able to process these simple tables using form recognizer, I am facing issues in processing over tables when they contain hierarchical column or rows. Any fix for this as I have spent week and nothing working out on all kinds of tables
I am taking about tables from pdf and not excel
You mean a column having sub-columns?
@ravitheja anything I can use?
Not sure. Haven't found anything that is useful here. @Logan M do you have any recommendations?
for reference, i want to extract table perfectly from this page without losing any hierachial information of columns
yea, I think we would all love to do that πŸ˜‰

Table extraction is a hard problem, and it's not even close to being solved.

Trying using unstructured or camelot to parse the tables πŸ€”
@Logan M i have tried camelot, form recognizer, adobe dev api but nothing is resulting in perfect extraction. I am losing information in one way or another
Yea I think you will find that no tool is perfect yet, and probably won't be for many years πŸ˜…
I did my entire masters thesis on document information extraction
its a super hard problem
yeah i have to just explain it to my client in my project as they think everything is possible with llm LMAO
hahaha oh boy

Newer models like nougat or kosmos2 are getting closer.

Kosmos2.5 should be coming out soon, and the results look quite impressive, at least from the paper.

But the hard part is getting a model that will generalize to everything. You could always annotate data and train a model for a specfic domain or set of data, but this requires work (and assumes you know ahead of time the formats the model needs to work well on)
actually in my case documents are very generic so its even more tough. We are working with companies csr/annual reports for esg and each have their own unique template and cannot generalize as it's too many companies.
yup, that's super hard haha

Is the issue you need ALL table data, or is there a specific field of data you need from each table?
because the latter is much more doable
no i need everything without losing any information at all
anyways thank you so much for confirming that there is nothing for it right now as i was looking for it from last 2 weeks continuous but in vain
Hopefully kosmos2.5 comes out soon. It really does look promising https://arxiv.org/pdf/2309.11419.pdf
thanks will look into this
Add a reply
Sign up and join the conversation on Discord