Find answers from the community

Updated 2 days ago

Efficient Approach for Generating Embeddings from CSV Files for Retrieval-Augmented Generation

Hi everyone,
I’m working with CSV files and exploring the best way to generate and save embeddings for them. I noticed that PagedCSVReader creates one embedding per row, which can be time-consuming for large files.

Could you recommend a more efficient approach to generate embeddings while maintaining accuracy for Retrieval-Augmented Generation (RaG)? I’m looking for something that balances embedding granularity and performance, especially for structured tabular data.

Thanks in advance for your insights!
W
M
9 comments
Are you going to ask query top of your CSV file?
Like give me top 5 records ?
Yes, like Provide the transaction record details along with the total for November 18th.

There are total 5-20 records for each date.
Ah then you need to check Pandas qeury engine if you are going to query on your csv records.
This is experimental: https://docs.llamaindex.ai/en/stable/examples/query_engine/pandas_query_engine/

If this doesnt fix your needs, You can checkout PandasAI too,its specifically designed for query on your data principle: https://docs.pandas-ai.com/intro
@WhiteFang_Jr Thanks for your suggestion! I will definitely try this.

Could you suggest the most effective approach—should I create embeddings row-wise (using PagedCSVReader) or chunk-wise (using CSVReader)?
Row wise gave me better response for my usecase
Yes, but this approach is time-consuming for large files with around 1 lakh rows
True, But if you use pandas AI or pandas query engine
you dont have to create embeddings for this.

What happens there is these two tools have the head info of the CSV and then based on your query they form a pandas query and then apply it on the pandas dataframe. and then based on the result it provides the answer.

You have the feature to not expose your own data and pandasAI creates a sample data based on the head to provide answers
Okay, i will try pandas AI in my use case.
Thanks for your Help !
Add a reply
Sign up and join the conversation on Discord