Logan M needed some guidance help

At a glance

The community members are discussing a use case where they need to read questions from a PDF question paper and grade them using an LLM (likely a pydantic program). The main challenges identified are:

1. Getting the text from the PDF correctly, which could involve using a normal PDF loader or OCR if the PDF is not true-digital.

2. Parsing the PDF text to extract "Question" objects, as a normal PDF loader may just return the raw text.

The community members suggest that the PDF should be somewhat formatted to make the parsing easier. They also discuss the possibility of using a package like camelot to extract tables and images from the PDF.

rrini

needed some guidance/help.
My use case is that I am given a question paper and for each question paper there's a corresponding marking scheme. I need to read the questions from the question paper pdf. The LLM shouldn't create it's own questions. Same for marking scheme. I feel it's a good use case for OpenAIPydantic program. What do you think?

8 comments

LLogan M

Yea that sounds about right. There's two part here -- getting the text off the PDF correctly, and then grading it with an LLM (likely a pydantic program)

rrini

How to achieve the "getting the text off the PDF correctly" part - same pydantic program right?
Or a normal query engine will do?

LLogan M

I think just a normal PDF loader will work? Or if it's not a true-digitial PDF, you may have to use OCR?

rrini

But a normal PDF loader wouldn't return "Question" objects right? It will just read the pdf text.

LLogan M

Right -- I'm assuming the PDF is somewhat formatted though, so hopefully it's easy to just parse/split the text?

rrini

Hmmm. I can work on making the PDF documents such.
Also, does llamaindex have a loader for tables and diagrams?

LLogan M

Hmm not really 🤔 You can use a package like camelot to try and get tables out of PDFs

For images, I think some PDF libraries can also spit out images

rrini

got it!
Thanks ❤️

Add a reply

Find answers from the community

Logan M needed some guidance help