For 'nice' pdfs the python ocr library pytesseract works ok. For not-nice scans it can make mistakes. Google Cloud's DocumentAI has good off-the-shelf document parsers. There's also an option to hand-label documents (the 'custom extractor') and if your labeling is good, its pretty much perfect. However, its annoying to navigate the GCP interface, unfortunately