Find answers from the community

Updated 2 years ago

llamaindex for big numerical csv data

At a glance

The community member is having issues ingesting a large 10MB CSV file into llamaindex, as the responses seem to include made-up numbers and incorrect counts. Another community member suggests trying to load the data into a SQL database or Pandas dataframe first, rather than directly into llamaindex, as chunking the data and putting it into a vector database may not be the best approach.

The community members discuss how to use llamaindex with SQL or Pandas, and a link is provided to the llamaindex documentation on SQL integration. The high-level idea is to first load the CSV data into a SQL database or Pandas dataframe, and then use llamaindex's text-to-SQL or Pandas functionality to interact with the data.

The community members seem optimistic that this approach could work well and that llamaindex could be the future for their use case.

Useful resources
What’s the best way to ingest big csv data (10MB) into llamaindex? I tried it but it’s a bit hallucinating. The response make up numbers that don’t exist, and is wrong at counting something (e.g numbers of transaction with amount 190.00)
S
j
8 comments
When I try to predict a trend by cities, the answer is somewhat correct. They answer Jakarta is the biggest transaction, Surabaya second, and Bandung comes next.

But I’m worried that it might give misleading result soon when I really use it for my company
@Senna have you tried putting it in a sql dataframe or pandas index? chunking it up and putting in a vector db typically isn't a good idea
Will you tell me more about it? I know how dataframe or pandas index works, but how is that going to works with llamaindex (because I have to pay attention to token limits) ?
yeah check out our SQL guide and pandas index demo:

https://gpt-index.readthedocs.io/en/latest/guides/tutorials/sql_guide.html

https://gpt-index.readthedocs.io/en/latest/examples/index_structs/struct_indices/PandasIndexDemo.html

high-level idea is that you first load CSV into a SQL database or dataframe (not using an LLM, just via code), and then you can use our text-to-sql or pandas functionality
so llamaindex is going to figure out the query for me depends on the prompt right?
Thanks btw will read it! if this works well, this llamaindex thing is the future 🌎
Add a reply
Sign up and join the conversation on Discord