querying tabular data

did you explore text2sql?

TTungdepzai

I kinda have to make a one size fit all for all kinds of tables so I cant really use sql or structured indexes, so far I've only found using vector index and mapping each row to a document fits my usecase, are there any better ways?

Okay. What is the problem you are facing here?

Where does it fail?

TTungdepzai

the issue is vector searching is not able to query my structured data accurately, Im using chromadb, example:

q: what is the ship city of order with order id: 405-8078784-5731545.
what it found: documents : [Order ID: 408-5748499-6859555, Order ID: 408-7955685-3083534], distances: [[0.3717946789733443, 0.38357981654556134]]

Maybe llamaindex can provide some kind of combined text search and vector search to improve this? 🤔

Attachment

Okay if you are querying this kind of table text2sql is best option but if you are looking for text2sql + RAG you can check sqljoinquery engine or sqlautoquery engine

@ravitheja i am facing similar issue but I am able to process these simple tables using form recognizer, I am facing issues in processing over tables when they contain hierarchical column or rows. Any fix for this as I have spent week and nothing working out on all kinds of tables

I am taking about tables from pdf and not excel

You mean a column having sub-columns?

Yes correct

@ravitheja anything I can use?

Not sure. Haven't found anything that is useful here. @Logan M do you have any recommendations?

for reference, i want to extract table perfectly from this page without losing any hierachial information of columns

yea, I think we would all love to do that 😉

Table extraction is a hard problem, and it's not even close to being solved.

Trying using unstructured or camelot to parse the tables 🤔

@Logan M i have tried camelot, form recognizer, adobe dev api but nothing is resulting in perfect extraction. I am losing information in one way or another

Yea I think you will find that no tool is perfect yet, and probably won't be for many years 😅

I did my entire masters thesis on document information extraction

its a super hard problem

yeah i have to just explain it to my client in my project as they think everything is possible with llm LMAO

😆

hahaha oh boy

Newer models like nougat or kosmos2 are getting closer.

Kosmos2.5 should be coming out soon, and the results look quite impressive, at least from the paper.

But the hard part is getting a model that will generalize to everything. You could always annotate data and train a model for a specfic domain or set of data, but this requires work (and assumes you know ahead of time the formats the model needs to work well on)

actually in my case documents are very generic so its even more tough. We are working with companies csr/annual reports for esg and each have their own unique template and cannot generalize as it's too many companies.

yup, that's super hard haha

Is the issue you need ALL table data, or is there a specific field of data you need from each table?

because the latter is much more doable

no i need everything without losing any information at all

:PSadge:

anyways thank you so much for confirming that there is nothing for it right now as i was looking for it from last 2 weeks continuous but in vain

Hopefully kosmos2.5 comes out soon. It really does look promising https://arxiv.org/pdf/2309.11419.pdf