Find answers from the community

Updated 9 months ago

Large (400+% larger) storage size in relation to size of original dataset

I am working on a project where I will ultimately be processing thousands of PDF files (full length ebooks) with the llama-index PDF reader, and using the contents of these PDF files to augment the capabilities of a Mistral 7B chatbot.

I just ran an initial test where I processed about 50 files (~300MB worth of data) and when I did index.storage_context.persist(persist_dir=".") the resulting JSON took up approximately 1.2GB - a ~400% increase in storage space compared to the original dataset... is this pretty typical or is there perhaps something I'm doing wrong that is making it more bloated than in needs to be?
๏ฟฝ
2 comments
(I'm happy to share more specific info on how I created the index if needed, but am just curious if this dataset:index storage ratio is roughly typical when using llama-index)

I just can't conceptually wrap my head around what would be taking up so much space if it's just chunking the text from the PDFs and storing vector embeddings for the chunks that take up a few KB each ...

if anything, because it's only extracting plaintext from the PDFs, I would actually expect the processed/indexed data to be SMALLER because it doesn't include all of the other contents of the PDF (images, formatting info, etc)
Large (400+% larger) storage size in relation to size of original dataset
Add a reply
Sign up and join the conversation on Discord