Efficiently Handling Large CSV Files with LlamaIndex
Efficiently Handling Large CSV Files with LlamaIndex
At a glance
The community member is facing an issue while trying to read a large CSV file (around 20M+ rows) using SimpleDirectoryReader, which seems to struggle with handling such a large file. They are asking if it is possible to read this file using CSVReader or if there are any other recommended approaches within LlamaIndex for efficiently handling large CSV files.
In the comments, a community member suggests customizing the Reader class and using the pandas buffer approach to load such a large file. They also advise checking if the RAM is capable of handling heavy files in memory. Another community member suggests splitting the CSV into pieces, as 20M rows is a lot, and the default CSV reader might split each row into its own document, which may not be desirable unless it's just a list of QA pairs or something textual.
I'm facing an issue while trying to read a large CSV file (around 20M+ rows) using SimpleDirectoryReader. It seems to struggle with handling such a large file.
Is it possible to read this file using CSVReader? Or are there any other recommended approaches within LlamaIndex for efficiently handling large CSV files?
You might have to split the CSV into pieces, 20M rows is going to be a lot. I'm pretty sure the default csv reader is splitting each row into its own document, which you probably also don't want to do (unless its just a list of QA pairs or something textual)