Find answers from the community

Updated 2 years ago

Are there any pre trained model that

At a glance

The community members are discussing options for querying and extracting meaningful information from CSV data. They have considered using a LlamaHub Loader to chunk the CSV into documents, but found that this injects sequential structure which is not very useful. They have also looked at Text-to-SQL and Text-to-Pandas approaches, but have encountered issues with domain-specific columns and latency.

The main challenge is finding a pre-trained model that can effectively convert CSV rows into meaningful text, which could then be used to generate embeddings and store in a vector database for Q&A. The community members have discussed the possibility of training and hosting their own language model, but this is still a work in progress.

One community member suggests adding extra text descriptions to the SQL schema to help clarify the confusing columns, while another notes that PDF data is generally easier to work with than numerical CSV data. The community is still exploring solutions that do not rely on large language models, at least for responding to user queries.

There is also a separate discussion about issues with the GPTSQLStructStoreIndex object in the Llama Index library, where the schema is not being properly recognized during query time.

Are there any pre-trained model that works good in converting CSV rows into meaningful text.

So that I can get embeddings from converted text and store in some VectorDB which will help me in doing QNA on CSV ?
1
S
S
L
9 comments
I am aware of three options for querying over structured data like a csv;
  • Use a LlamaHub Loader. This will chunk something like a CSV into documents. This injects sequential structure into the data, which in my experience does not work very well.
  • TextToSql - query gets converted to SQL and executes. Maybe with an Agent to handle errors and reexecute
  • TextToPandas - same as above, but generates pandas and executes over your data
To your original questions, I have not see an embedding model that is able to convert CSV rows into meaningful text (or at least results it better downstream results than TextToSql). Embedding (in NLP) models are trained on sequential data (text), thus the inject sequential structure into data which I find is not that useful for most downstream applications. Long story short, take a look an TextToSQL or Pandas paradigm as a start.
Our CSV has bit sensitive data, so we can't send data to any LLM.

So considering this case I'm left with only first option πŸ€”
Why does the CSV data need to get sent to the LLM? the queries are created without actually looking at the data, just the schema of the data and explanation of it. That probably break privacy as well though. Have you thought about training & hosting your own LLM? That could fix your problems.
Ok, got it now. But still there are few problems with this. Some of the column are very domain specific. So I think Text to SQL won't be effective here.

Also one more problem is latency (these 2 we have found with few tests we have done)

We are still doing some POCs on choosing open source LLMs, but this is very far from now πŸ˜€
You can add extra text descriptions of the table using a sql index, in addition to the schema, to help clarify the confusing columns
Yeah that might help.

We already support querying on unstructured data like PDF. So we are looking for some generalized solution by converting CSV into unstructured data.

Yeah but not sure how good this approach isπŸ₯²
PDF is much easier. But normally CSV data is like numerical columns that all mean something, and there's no way for an LLM to read all that.

Text2sql is likely the ideal solution, at least in my mind, since SQL is already pretty expressive
Hmm, I'm trying to find some solution without LLM atleast while responding to user queries
Hi, I'm trying to create a GPTSQLStructStoreIndex object for a MS-SQL database that has two schemas:
"sales" and "production". Whenever I create the GPTSQLStructStoreIndex, I specify the sql_database object as a SQLDatabase(engine, schema="sales"). However, during query time, llama_index always reverts to the schema as "dbo" and reports my table names as not found. Any clues as to why this is happening? Thanks very much.
Add a reply
Sign up and join the conversation on Discord