I'm looking for a faster way to load PDFs than pymupdf4llms.
I'm looking for a faster way to load PDFs than pymupdf4llms.
At a glance
The community member is trying to find the best way to load PDFs, and has settled on using pymupdf4llms instead of the PdfMarkdownReader from the marker-py library, as the latter has very long indexing times. They are concerned that SentenceSplitters may also have long processing times, as it requires torch. The comments suggest that sentence splitters may not actually require torch, and that pymupdf4llms or a pptx reader could be alternatives. However, the pptx reader has dependencies on transformers, which is causing issues for other developers. The community members recommend writing a custom reader to avoid these dependencies.
I'm trying to find the "best" way to load PDF's, and I settled for now on the pymupdf4llms because I was using the PdfMarkdownReader from the marker-py lib, but I was hoping to avoid insanely long indexing time
I think SentenceSplitters requires torch, which makes sense, but on any decent server they probably take a long while I presume if it goes at 2.19it/s for my computer with a 4090 24 GB?
I think it would be beneficial to offer that to users too, cause the only change required would be using a vision LLM to "see" what's on the page, vs using torch