Find answers from the community

Updated 2 months ago

Hello all,

Hello all,
I’m looking to expand past XML to PDFs, and the one big issue is the one issue everyone has—tables. Is there a recommended OSS way to read them? Specifically something you’d recommend be used with LlamaIndex?
L
i
s
15 comments
probably unstructured will be the best OSS solution
but overall tables are hard
marked is another OSS library that does ok-ish
Is OCR an acceptable solution
OCR is really only half of the solution
Sure you can get the text -- but then you need to make sure its formatted nicely
Oh of course yeah
And then there’s the issue of hyperlinks
it really is the worst file format possible lol
and the most used
@isaackogan mind sharing the pdf file you're trying to read?
no sorry I’m testing with my employee pay statement 💀
Add a reply
Sign up and join the conversation on Discord