Efficient Approach for Generating Embeddings from CSV Files for Retrieval-Augmented Generation

Question

Hi everyone,I’m working with CSV files and exploring the best way to generate and save embeddings for them. I noticed that PagedCSVReader creates one embedding per row, which can be time-consuming for large files.Could you recommend a more efficient approach to generate embeddings while maintaining accuracy for Retrieval-Augmented Generation (RaG)? I’m looking for something that balances embedding granularity and performance, especially for structured tabular data.Thanks in advance for your insights!

WhiteFang_Jr · Answer

Are you going to ask query top of your CSV file?

WhiteFang_Jr · Answer

Like give me top 5 records ?

Manan Patel · Answer

Yes, like Provide the transaction record details along with the total for November 18th.There are total 5-20 records for each date.

WhiteFang_Jr · Answer

Ah then you need to check Pandas qeury engine if you are going to query on your csv records.
This is experimental: https://docs.llamaindex.ai/en/stable/examples/query_engine/pandas_query_engine/

If this doesnt fix your needs, You can checkout PandasAI too,its specifically designed for query on your data principle: https://docs.pandas-ai.com/intro

Manan Patel · Answer

@WhiteFang_Jr Thanks for your suggestion! I will definitely try this.

Could you suggest the most effective approach—should I create embeddings row-wise (using PagedCSVReader) or chunk-wise (using CSVReader)?

WhiteFang_Jr · Answer

Row wise gave me better response for my usecase

Manan Patel · Answer

Yes, but this approach is time-consuming for large files with around 1 lakh rows

WhiteFang_Jr · Answer

True, But if you use pandas AI or pandas query engine
you dont have to create embeddings for this.

What happens there is these two tools have the head info of the CSV and then based on your query they form a pandas query and then apply it on the pandas dataframe. and then based on the result it provides the answer.

You have the feature to not expose your own data and pandasAI creates a sample data based on the head to provide answers

Manan Patel · Answer

Okay, i will try pandas AI in my use case.Thanks for your Help !

Find answers from the community

Efficient Approach for Generating Embeddings from CSV Files for Retrieval-Augmented Generation