Extracting plots from academic publications using Llame...

At a glance

Hi folks, I'm trying to use LlameParse to extract plots from academic publications. While the parser is able to extract obvious pictures, it cannot extract plots (see examples). I also added prompt (see below) to optimize the performance, but currently has no luck. Suggestions & insights are greatly appreciated. Thanks!

Prompt:
ins = """
You are a highly proficient language model designed to convert pages from PDF into structured markdown text. Your goal is to accurately transcribe text, identify and describe images, particularly graphs and other graphical elements.

You have been tasked with creating a markdown copy of each page from the provided PDF image. Each image description must include a full description of the content, a summary of the graphical object.

Maintain the sequence of all the elements.

For the following element, follow the requirement of extraction:
for Text:

Extract all readable text from the page.
Exclude any diagonal text, headers, and footers.

for Text which includes hyperlink:
-Extract hyperlink and present it with the text

for Image Identification and Description:

Identify all images, graphs, and other graphical elements on the page.
If image contains wording that is hard to extract , flag it with <unidentifiable section> instead of parsing.
For each image, include a full description of the content in the alt text, followed by a brief summary of the graphical object.
If the image has a subtitle or caption, include it in the description.
If the image has a organisation chart , convert it into a hierachical understandable format.
for graph , extract the value in table form as markdown representation

OUTPUT INSTRUCTIONS

Exclude any diagonal text, headers, and footers from the output.
For each image and graph, provide a detailed description and summary.

"""

Attachments

14 comments

LLogan M

Are you using premium mode? Or what llama-parse settings are you using?

llucawang_nfls

When I'm using the GUI, I tried both the accurate mode and premium mode. The accurate mode did not extract any plots, while the premium mode screenshot every page as output images. I did not change any specific settings.
When I'm using the API via Python, I plugged in the prompt (as shown above), and used self.parser.get_json_result() to get json as the output format. The other settings are default.

LLogan M

premium mode also parses the screenshots though?

LLogan M

It does more than output every page as images 👀

llucawang_nfls

nope, the premium mode just takes screenshots for each page and that's it.

llucawang_nfls

are there any ways that I may enhance the performance? I did some research and it seems that the parser works well for raster images, but is likely to fail for vector graphics.

LLogan M

If you share a file that you are running, I can likely check it out?

I'm nearly 100% certain premium takes screenshots AND parses those screenshots for content lol

LLogan M

like, I can run premium mode right now and get text and markdown out of any document

llucawang_nfls

thanks!
Maybe I misunderstood. What I mean is that the premium mode only takes screenshots under the "image" tab. However it does not extract individual images. When I said "premium mode does not parse screenshots" I mean it does not extract the images out.
But of course, the premium mode parses texts and tables as .md, just like the fast / accurate mode.
I'm good with the markdown part, and the problem is just about the images.

LLogan M

ah ok, that clarifies a lot lol

I think the page-specific-images extraction depends a lot on how the PDF is built. If its not an embedded image in the PDF (Which it looks like its not?) it may not detect/extract it

LLogan M

which indeed seems to be the case here

LLogan M

in the cases where the LLM is able to, I see it converting some graphs to tables

llucawang_nfls

yup some graphs (with numbers and texts associated with them) are indeed parsed into tables. I heard that some OCR tools may make it (accurately extracting all images), but that's out of the scope of my project lol

llucawang_nfls

thanks for your help!

Add a reply

Find answers from the community

Extracting plots from academic publications using LlameParse