Llama parse does not follow parsing instruction

At a glance

The community member is facing an issue with the llama_parse.LlamaParse library, where the parsing instructions are not being followed. They have provided their code and the parsing instructions they are using, but the output still includes elements that should be excluded, such as the references and bibliography sections.

Other community members have suggested trying different settings, such as setting is_formatting_instruction=False or using different modes like "Accurate" mode instead of "Fast" mode. One community member also suggested trying a prompt like "translate everything in Spanish" to see if the instructions are being followed.

The community member has tried various settings, but the issue persists, and they are still unable to get the desired output. They have reached out to other community members for further assistance, but a definitive solution has not been provided yet.

Useful resources

rrooooray

Hello, I am facing a problem with llama_parse.LlamaParse. It really does not follow the parsing_instruction. Where could I get some help?

18 comments

LLogan M

You may have to set is_formatting_instruction=False

It also depends on the mode being used (for example, text mode probably won't use parsing instructions, I think)

rrooooray

Hey! Thanks for reaching out

rrooooray

I am raelly not an expert: my subroutine is like this:

rrooooray

def parse_pdf(pdf_file, output_file):
parsing_instructions = """
The provided pdf document is a multi page scientific article.
The following instructions should be followed:

Include the journal title, journal name, and authors ONLY at the beginning of the document as they appear on the first page.
Exclude repeated occurrences of the journal title, journal name, and authors on subsequent pages (e.g., in headers or footers).
Preserve the logical flow of the document's main content without splitting paragraphs: focus on maintaining text continuity and readability.
Exclude non-essential elements such as: page titles, page number, headers and footers.
Do not return figures, tables, acknowledgments, funding information and references
Do not return any non-ASCII or control characters, publisher details, download information, copyright indications.
I repeat, do not return References or Bibliography sections, as they are not part of the main content."""
parser = LlamaParse(result_type="markdown", parsing_instruction=parsing_instructions,language="en")
parser = LlamaParse(result_type="markdown",verbose=True,language="en")
md_data = None

md_data = SimpleDirectoryReader(
input_files=[pdf_file],
required_exts=[".pdf"],
encoding="utf-8",
file_extractor={".pdf": parser}
).load_data()

# Check if md_data is empty or None
if not md_data:
print(f"Error: No data returned when parsing the file '{pdf_file}'. Skipping this file.")
return None

# Proceed to save the data only if parsing was successful
with open(output_file, 'wb') as f:
pickle.dump(md_data, f)

return md_data

rrooooray

https://github.com/run-llama/llama_parse/blob/main/llama_parse/base.py following this I set is_formatting_instructions to True.

LLogan M

Did you try setting it to false though? Its true by default

rrooooray

I just tried both, always comes with all references... its crazy.. its like the instructions are ignored

LLogan M

@Sacha Bron or @pld any ideas? I feel like I've seen this a few times.

SSacha Bron

Hi @rooooray, can you send me your jobID so I can look at the logs?

rrooooray

Hey, here it is: Started parsing the file under job_id 295a9fee-1a79-461a-9e97-d5252b9c5983

EEasyAI (Chris)

did you find a solution?

rrooooray

hey not really, I have lots of papers to parse, but they all return with bibliography and a lot of things as if my instructions are ignored.

EEasyAI (Chris)

Ol thanks for letting me know. Having similar issues

rrooooray

hey @Logan M and @Sacha Bron anything we can do to try to fix this ? ideas?

SSacha Bron

You need to try another mode. Fast Mode doesn't produce Markdown and thus will not prompt your instructions.

SSacha Bron

Try at least in Accurate mode

SSacha Bron

To test, you can try a prompt like "translate everything in Spanish" so it's more obvious if it works or not

rrooooray

This seems to make it work:

is_formatting_instruction=False,
verbose=True,
disable_ocr=True,
premium_mode=True,
fast_mode=False,

Add a reply

Find answers from the community

Llama parse does not follow parsing instruction

parser = LlamaParse(result_type="markdown",verbose=True,language="en")