Find answers from the community

Updated 4 weeks ago

I'm looking for a faster way to load PDFs than pymupdf4llms.

I'm trying to find the "best" way to load PDF's, and I settled for now on the pymupdf4llms because I was using the PdfMarkdownReader from the marker-py lib, but I was hoping to avoid insanely long indexing time

I think SentenceSplitters requires torch, which makes sense, but on any decent server they probably take a long while I presume if it goes at 2.19it/s for my computer with a 4090 24 GB?
L
Z
10 comments
sentence splitters do not require torch?
oh, then I guess pymupdf4llms or something? or the pptx reader actually
@Logan M is it possible to use PptxReader without torch
it's killing our other devs ability to use it as it's so long to install it literally is stopping them lol
the pptx has hard dependencies on transformers (its using them to extract content)

I recommend writing your own reader?
I basically copied yours but used pptx to extract text and stuff, and then an LLM to see the page
I think it would be beneficial to offer that to users too, cause the only change required would be using a vision LLM to "see" what's on the page, vs using torch
updating the pptx one still to remove some deps
cause it was a custom one
Add a reply
Sign up and join the conversation on Discord