Find answers from the community

Updated last month

Efficiently Handling Large CSV Files with LlamaIndex

At a glance

The community member is facing an issue while trying to read a large CSV file (around 20M+ rows) using SimpleDirectoryReader, which seems to struggle with handling such a large file. They are asking if it is possible to read this file using CSVReader or if there are any other recommended approaches within LlamaIndex for efficiently handling large CSV files.

In the comments, a community member suggests customizing the Reader class and using the pandas buffer approach to load such a large file. They also advise checking if the RAM is capable of handling heavy files in memory. Another community member suggests splitting the CSV into pieces, as 20M rows is a lot, and the default CSV reader might split each row into its own document, which may not be desirable unless it's just a list of QA pairs or something textual.

I'm facing an issue while trying to read a large CSV file (around 20M+ rows) using SimpleDirectoryReader. It seems to struggle with handling such a large file.

Is it possible to read this file using CSVReader? Or are there any other recommended approaches within LlamaIndex for efficiently handling large CSV files?
W
L
2 comments
You can customise the Reader class and use pandas buffer approach to load such a large file.

That way it can help you load such large file. Btw do check if your RAM is capable to load heavy files in the memory
You might have to split the CSV into pieces, 20M rows is going to be a lot. I'm pretty sure the default csv reader is splitting each row into its own document, which you probably also don't want to do (unless its just a list of QA pairs or something textual)
Add a reply
Sign up and join the conversation on Discord