Find answers from the community

Updated 11 months ago

Large (400+% larger) storage size in relation to size of original dataset

At a glance

The community member is working on a project that involves processing thousands of PDF files using the llama-index PDF reader, and using the contents to augment the capabilities of a Mistral 7B chatbot. After an initial test of processing about 50 files (~300MB), the resulting JSON file took up approximately 1.2GB, a ~400% increase in storage space compared to the original dataset. The community member is curious if this dataset-to-index storage ratio is typical when using llama-index, or if there might be something they are doing wrong that is causing the data to be more bloated than necessary.

In the comments, another community member notes that the large (400+% larger) storage size in relation to the size of the original dataset seems concerning, and suggests that the community member may want to look into why the processed/indexed data is taking up so much more space than the original PDFs, as the plaintext extraction should actually result in a smaller file size.

I am working on a project where I will ultimately be processing thousands of PDF files (full length ebooks) with the llama-index PDF reader, and using the contents of these PDF files to augment the capabilities of a Mistral 7B chatbot.

I just ran an initial test where I processed about 50 files (~300MB worth of data) and when I did index.storage_context.persist(persist_dir=".") the resulting JSON took up approximately 1.2GB - a ~400% increase in storage space compared to the original dataset... is this pretty typical or is there perhaps something I'm doing wrong that is making it more bloated than in needs to be?
๏ฟฝ
2 comments
(I'm happy to share more specific info on how I created the index if needed, but am just curious if this dataset:index storage ratio is roughly typical when using llama-index)

I just can't conceptually wrap my head around what would be taking up so much space if it's just chunking the text from the PDFs and storing vector embeddings for the chunks that take up a few KB each ...

if anything, because it's only extracting plaintext from the PDFs, I would actually expect the processed/indexed data to be SMALLER because it doesn't include all of the other contents of the PDF (images, formatting info, etc)
Large (400+% larger) storage size in relation to size of original dataset
Add a reply
Sign up and join the conversation on Discord