Hello folks π
I would like to kindly ask for a small guidance.
π
What is it about?ββββββββββββ
β Building a
pipeline which allows the user to
upload their PDFs (say medical report)β Generally, reports from different labs
have a different structure of writing the test name, result, normal range etc.
β I would like to
extract the table and create a structured table that can later be used.
β So the task is to convert the unstructured content into the structured.
Here I am going with an asusmption that I won't require the OCR,
because the PDF will only contain the text.
π
I have tried...ββββββββββ
Python libraries such as:
β
camelot
β
read_pdf
β
tabula-py
etc.
But they
either don't give just the table which is required (gives other information as well)
or don't recognize that there is any table at all!
π
I am asking for an advice for...ββββββββββββββββββ
Can we:
- Extract all text data from the PDF (including the table)
- Give the GPT-3 or any other LLM to create the structured table where it can see the medical tests?
Or is there any other approach that I should be using, which is more robust and accurate?
Please help, thanks π