Find answers from the community

Updated 3 months ago

Parsing

I have been looking for ways to parse PDFs into titles sections paragraphs and tables as well as caption figures. My goal is to have a very robust indexer that will maximize the llms understanding of PDF snippets by using things like section titles as metadata and trying to do the chunking at section boundaries and keeping tables together recognizing column headers and so on.

It seems that none of the PDF integrations for llama index do all this as they all just rip the text from the PDF without these structural elements. Unfortunately I can't use third-party apis for this as the PDFs are sensitive company data.

I tried looking into the library's behind the integrations including pymu and PDF miner six and so on and found that these can do things like make an HTML file with the right layout but they still didn't find the actual structure of the PDF in terms of titles sections paragraphs and so on. Llama index does have the unstructured reader integration but its usage of it is extremely simple using the low res auto partition and then just concatenating all of the bits of text like the other readers so again it's not what I need.

However when I looked into the unstructured library it is the most powerful one I have found so far. They seem to throw thing they can find at the problem including OCR and other computer vision machine learning models. What was lacking however was converting the list of elements they find into a structure that is easily consumed by a file reader or parser. For example if I could convert to markdown there are parsers that will turn the heading into meta data.

It seems like this last step is fairly straightforward and I'm wondering if someone has already written some code that does it.

Finally I am wondering if there is another library or integration that I missed and that would do everything that I want.
L
s
d
6 comments
You could convert the unstructured output to markdown yourself, I'm not sure it would be that complex πŸ€”
I tried your suggestion but hit a snag: Unstructured doesn't seem to provide me with category_depth when processing PDFs so I can't infer the ListItem depth or header depth. Other than that though, it is straightforward:

Plain Text
s = ''
for elem in elements:
    if elem.category == 'Title':
        s += f'# {elem.text}\n\n'
    if elem.category == 'NarrativeText' or elem.category == 'UncategorizedText':
        s += f'{elem.text}\n\n'
    if elem.category == 'ListItem':
        s += f'* {elem.text}\n\n'
    if elem.category == 'Table':
        s += f'{elem.metadata.text_as_html}\n\n'


To fix the structure issue, I ran it through GPT 3.5 and then got back an almost perfect transcription of the original!

With a temperature of 0, my prompt is:
Plain Text
You are given some text extracted from a PDF, but the document structure got a bit mixed up.  We seem to have lost any section and list structure there was, and some paragraphs have been split where they should not have been.  The Table Of Contents looks messed up too, but maybe it will give some hint about the structure.  Please Transcribe as much of the document as you can, with the proper structure in Markdown with HTML tables.  DO NOT rephrase any text.
I haven't calculated whether this is cost-prohibitive, but at least you don't need a very smart LLM
Llm Sherpa is really good. Can run your own docker server
Has html converter as well
So... this https://github.com/nlmatics/nlm-ingestor
Very interesting... I wish I knew more about them or could find a user community.
Add a reply
Sign up and join the conversation on Discord