Find answers from the community

Updated 6 months ago

GitHub - VikParuchuri/marker: Convert PD...

At a glance
flipping tables in PDFs! 😠 I tried https://github.com/VikParuchuri/marker to parse PDF's to markdown to improve parsing and chunking but tables i.e. budget documents with 'funky' formatting such as merged cells cause the markdown tables to be parsed incorrectly.... azure document intelligence works better.... but would like a local and/or open-source package instead....
1
W
T
D
14 comments
You could try unstructured, I have used that it works fine for me
Extracts table details correctly for me
yeah this is my table....
Attachment
image.png
its a bit rickey
llamaparse is not open-source but offers 1K pages free per day. It will work the best in here IMO
okay, i'll havd a look thanks
Azure Document Intelligence actually parses the table okay
i would like something local though
i havent found a solid local solution yet either...
i'm playing with a 'correcting' agent where the document specs exist https://github.com/VikParuchuri/marker/issues/204#issuecomment-2208290322
for me, i'm parsing australian government budget papers and fortunately there is a PDF guide and XLS template
for tables, try:
  • using a vision model in unstructured
  • camelot
  • cascadtabnet
  • tabula
  • tabletransformer by microsoft
  • form recognized
let us know if any one of these options solves ur problem.
Add a reply
Sign up and join the conversation on Discord