Table extraction

jjackson hole

Hello folks 👋
I would like to kindly ask for a small guidance.

👉 What is it about?
————————————
→ Building a pipeline which allows the user to upload their PDFs (say medical report)
→ Generally, reports from different labs have a different structure of writing the test name, result, normal range etc.
→ I would like to extract the table and create a structured table that can later be used.
→ So the task is to convert the unstructured content into the structured.

Here I am going with an asusmption that I won't require the OCR,
because the PDF will only contain the text.

👉 I have tried...
——————————
Python libraries such as:
→ camelot
→ read_pdf
→ tabula-py
etc.

But they either don't give just the table which is required (gives other information as well) or don't recognize that there is any table at all!

👉 I am asking for an advice for...
——————————————————
Can we:

Extract all text data from the PDF (including the table)
Give the GPT-3 or any other LLM to create the structured table where it can see the medical tests?

Or is there any other approach that I should be using, which is more robust and accurate?

Please help, thanks 🙏

4 comments

LLogan M

Have you tried unstructured.io or deepdocdetection?

Extracting tables is a tough problem tbh

ssamuel

what's also interesting regarding the table problem is that LlamaIndex seems to require some improvements for tables even from markdown documents (.md).

For example, if you have a directory full of markdown files and you use tables (|header|value|boolean|), the whole table get ignored when you attempt to parse it (eg. ObsidianReader from llama_index). In the end had to do the table parsing manually using beautifulsoup and plain old regex 🙂

jjackson hole

Thanks for the suggestions @Logan M 🤗

The unstructured.io is amazing library with a full potential... but I couldn't find a way to figure out "how to keep the structure" of the text from the table after they are extracted.

Yes, indeed. It is the tough problem... and even worse if we don't know what kind of structure there will be (since lab reports have different structure lab-to-lab).

👉 I have even tried the OCR with Layout-Parser library. But there... many numbers are misread by the model. Ex. 17.4% becomes 174% 😃

Any idea how should I move forward?
Thanks 🙏

LLogan M

Yea that's a common problem with OCR, usually the text is never perfect... looking at the github issues, I think unstructured.io is still working on improving table extraction

You could try another deep learning approach here: https://github.com/deepdoctection/deepdoctection

If none of these are working well, you might have to annotate some data and train a model 😅 Seems like there's no easy way here for tables

Add a reply

Find answers from the community

Table extraction