Parsing

At a glance

The community member is looking for ways to parse PDFs into structured elements like titles, sections, paragraphs, and tables, in order to create a robust indexer that can maximize the understanding of PDF snippets by large language models. They have tried various PDF parsing libraries like PyMuPDF and PDFMiner, but found that they do not provide the level of structural information they need. The community member has also tried the unstructured library, which seems promising, but lacks a way to convert the extracted elements into a format that can be easily consumed by a file reader or parser, such as Markdown. They are wondering if someone has already written code to perform this conversion. Additionally, the community member is open to other libraries or integrations that could achieve their desired level of PDF parsing and structuring.

Useful resources

sskittythecat

I have been looking for ways to parse PDFs into titles sections paragraphs and tables as well as caption figures. My goal is to have a very robust indexer that will maximize the llms understanding of PDF snippets by using things like section titles as metadata and trying to do the chunking at section boundaries and keeping tables together recognizing column headers and so on.

It seems that none of the PDF integrations for llama index do all this as they all just rip the text from the PDF without these structural elements. Unfortunately I can't use third-party apis for this as the PDFs are sensitive company data.

I tried looking into the library's behind the integrations including pymu and PDF miner six and so on and found that these can do things like make an HTML file with the right layout but they still didn't find the actual structure of the PDF in terms of titles sections paragraphs and so on. Llama index does have the unstructured reader integration but its usage of it is extremely simple using the low res auto partition and then just concatenating all of the bits of text like the other readers so again it's not what I need.

However when I looked into the unstructured library it is the most powerful one I have found so far. They seem to throw thing they can find at the problem including OCR and other computer vision machine learning models. What was lacking however was converting the list of elements they find into a structure that is easily consumed by a file reader or parser. For example if I could convert to markdown there are parsers that will turn the heading into meta data.

It seems like this last step is fairly straightforward and I'm wondering if someone has already written some code that does it.

Finally I am wondering if there is another library or integration that I missed and that would do everything that I want.

6 comments

LLogan M

You could convert the unstructured output to markdown yourself, I'm not sure it would be that complex 🤔

sskittythecat

I tried your suggestion but hit a snag: Unstructured doesn't seem to provide me with category_depth when processing PDFs so I can't infer the ListItem depth or header depth. Other than that though, it is straightforward:

Plain Text

s = ''
for elem in elements:
    if elem.category == 'Title':
        s += f'# {elem.text}\n\n'
    if elem.category == 'NarrativeText' or elem.category == 'UncategorizedText':
        s += f'{elem.text}\n\n'
    if elem.category == 'ListItem':
        s += f'* {elem.text}\n\n'
    if elem.category == 'Table':
        s += f'{elem.metadata.text_as_html}\n\n'

To fix the structure issue, I ran it through GPT 3.5 and then got back an almost perfect transcription of the original!

With a temperature of 0, my prompt is:

Plain Text

You are given some text extracted from a PDF, but the document structure got a bit mixed up.  We seem to have lost any section and list structure there was, and some paragraphs have been split where they should not have been.  The Table Of Contents looks messed up too, but maybe it will give some hint about the structure.  Please Transcribe as much of the document as you can, with the proper structure in Markdown with HTML tables.  DO NOT rephrase any text.

sskittythecat

I haven't calculated whether this is cost-prohibitive, but at least you don't need a very smart LLM

ddrewskidang

Llm Sherpa is really good. Can run your own docker server

ddrewskidang

Has html converter as well

sskittythecat

So... this https://github.com/nlmatics/nlm-ingestor
Very interesting... I wish I knew more about them or could find a user community.

Add a reply

Find answers from the community

Parsing