Some I've used in the past are stuff like LiLT, Donut, or LayoutLM
Basically, these models are trained to look at a document and answer questions about it
For example, LiLT and LayoutLM look at the question + document text + bounding boxes from an image (and optionally the image itself for LayoutLMV2 and V3), and output the start/end indexes of the answer. Very reliable, since it's not generting text, just selecting text from the input that answers the question
Donut is a little different. The image+question is the only input. Donut reads the text automtatically and tries to write an answer. A little easier to use, but also slightly less reliable in my experience
Huggingface has some easy to use wrappers for this
https://huggingface.co/tasks/document-question-answering