Find answers from the community

Updated 2 months ago

I'm looking for a faster way to load PDFs than pymupdf4llms.

At a glance

The community member is trying to find the best way to load PDFs, and has settled on using pymupdf4llms instead of the PdfMarkdownReader from the marker-py library, as the latter has very long indexing times. They are concerned that SentenceSplitters may also have long processing times, as it requires torch. The comments suggest that sentence splitters may not actually require torch, and that pymupdf4llms or a pptx reader could be alternatives. However, the pptx reader has dependencies on transformers, which is causing issues for other developers. The community members recommend writing a custom reader to avoid these dependencies.

I'm trying to find the "best" way to load PDF's, and I settled for now on the pymupdf4llms because I was using the PdfMarkdownReader from the marker-py lib, but I was hoping to avoid insanely long indexing time

I think SentenceSplitters requires torch, which makes sense, but on any decent server they probably take a long while I presume if it goes at 2.19it/s for my computer with a 4090 24 GB?
L
Z
10 comments
sentence splitters do not require torch?
oh, then I guess pymupdf4llms or something? or the pptx reader actually
@Logan M is it possible to use PptxReader without torch
it's killing our other devs ability to use it as it's so long to install it literally is stopping them lol
the pptx has hard dependencies on transformers (its using them to extract content)

I recommend writing your own reader?
I basically copied yours but used pptx to extract text and stuff, and then an LLM to see the page
I think it would be beneficial to offer that to users too, cause the only change required would be using a vision LLM to "see" what's on the page, vs using torch
updating the pptx one still to remove some deps
cause it was a custom one
Add a reply
Sign up and join the conversation on Discord