I tried your suggestion but hit a snag: Unstructured doesn't seem to provide me with
category_depth
when processing PDFs so I can't infer the ListItem depth or header depth. Other than that though, it is straightforward:
s = ''
for elem in elements:
if elem.category == 'Title':
s += f'# {elem.text}\n\n'
if elem.category == 'NarrativeText' or elem.category == 'UncategorizedText':
s += f'{elem.text}\n\n'
if elem.category == 'ListItem':
s += f'* {elem.text}\n\n'
if elem.category == 'Table':
s += f'{elem.metadata.text_as_html}\n\n'
To fix the structure issue, I ran it through GPT 3.5 and then got back an almost perfect transcription of the original!
With a temperature of 0, my prompt is:
You are given some text extracted from a PDF, but the document structure got a bit mixed up. We seem to have lost any section and list structure there was, and some paragraphs have been split where they should not have been. The Table Of Contents looks messed up too, but maybe it will give some hint about the structure. Please Transcribe as much of the document as you can, with the proper structure in Markdown with HTML tables. DO NOT rephrase any text.