Find answers from the community

Updated 2 years ago

Hi What library libraries do you guys

At a glance
Hi! What library/libraries do you guys use to parse pdf files? Especially for pdfs with mixed format (e.g. containing paragraphs, tables, charts etc.)

When I use any of pypdf, pymupdf, pdfminer etc. they all mess up the format, when there are tables, for example. This results in bad input for LLM, hence bad output.
L
2 comments
an age old question
there's no perfect answer tbh. Unstructured.io tries to do this nicely, but I find it hit/miss

Ideal approaches detect tables/images/text and apply specific functions to properly parse them

I've also seen this, but haven't tried it yet https://facebookresearch.github.io/nougat/
Add a reply
Sign up and join the conversation on Discord