hey all this has probably been asked but

cchirag

hey all this has probably been asked but I have a pdf and I want to extract information like address and author in metadata? thoughts on how to do this? TIA

6 comments

LLogan M

You can either rely on some document analsysis models to extract this (for example, document question answering models), or you can try and feed the text to OpenAI to extract it using a pydantic program
https://gpt-index.readthedocs.io/en/stable/examples/output_parsing/openai_pydantic_program.html

cchirag

thanks @Logan M , what are some document analysis model ?

cchirag

also is the above a simliar approach as using MarvinMetadataExtractor

LLogan M

Some I've used in the past are stuff like LiLT, Donut, or LayoutLM

Basically, these models are trained to look at a document and answer questions about it

For example, LiLT and LayoutLM look at the question + document text + bounding boxes from an image (and optionally the image itself for LayoutLMV2 and V3), and output the start/end indexes of the answer. Very reliable, since it's not generting text, just selecting text from the input that answers the question

Donut is a little different. The image+question is the only input. Donut reads the text automtatically and tries to write an answer. A little easier to use, but also slightly less reliable in my experience

Huggingface has some easy to use wrappers for this
https://huggingface.co/tasks/document-question-answering

cchirag

awesome thanks! While I have you, I;m looking to build a Q/A bot off a slew of PDFs, but I'm looking to build a knowledge graph of the metadata,
"Paper has_author Author
Paper from_institute Institute
Paper has_publication_date Publication Date
Paper has_title Title
Paper published_in Journal"
and this graph and then be used to say answer questions like "Summarize all the papers from this institute" .. thoughts on where to get started? I have the graph down, curious how can build a q/a bot from there

cchirag

@Logan M

Add a reply

Find answers from the community

hey all this has probably been asked but