I'm looking for a faster way to load PDFs than pymupdf4...

At a glance

The community member is trying to find the best way to load PDFs, and has settled on using pymupdf4llms instead of the PdfMarkdownReader from the marker-py library, as the latter has very long indexing times. They are concerned that SentenceSplitters may also have long processing times, as it requires torch. The comments suggest that sentence splitters may not actually require torch, and that pymupdf4llms or a pptx reader could be alternatives. However, the pptx reader has dependencies on transformers, which is causing issues for other developers. The community members recommend writing a custom reader to avoid these dependencies.

ZZachHandley

I'm trying to find the "best" way to load PDF's, and I settled for now on the pymupdf4llms because I was using the PdfMarkdownReader from the marker-py lib, but I was hoping to avoid insanely long indexing time

I think SentenceSplitters requires torch, which makes sense, but on any decent server they probably take a long while I presume if it goes at 2.19it/s for my computer with a 4090 24 GB?

10 comments

LLogan M

sentence splitters do not require torch?

ZZachHandley

oh, then I guess pymupdf4llms or something? or the pptx reader actually

ZZachHandley

@Logan M is it possible to use PptxReader without torch

ZZachHandley

it's killing our other devs ability to use it as it's so long to install it literally is stopping them lol

LLogan M

the pptx has hard dependencies on transformers (its using them to extract content)

I recommend writing your own reader?

ZZachHandley

I basically copied yours but used pptx to extract text and stuff, and then an LLM to see the page

ZZachHandley

I think it would be beneficial to offer that to users too, cause the only change required would be using a vision LLM to "see" what's on the page, vs using torch

ZZachHandley

updating the pptx one still to remove some deps

ZZachHandley

cause it was a custom one

Add a reply

Find answers from the community

I'm looking for a faster way to load PDFs than pymupdf4llms.