Find answers from the community

Updated 2 months ago

Llama parse does not follow parsing instruction

At a glance

The community member is facing an issue with the llama_parse.LlamaParse library, where the parsing instructions are not being followed. They have provided their code and the parsing instructions they are using, but the output still includes elements that should be excluded, such as the references and bibliography sections.

Other community members have suggested trying different settings, such as setting is_formatting_instruction=False or using different modes like "Accurate" mode instead of "Fast" mode. One community member also suggested trying a prompt like "translate everything in Spanish" to see if the instructions are being followed.

The community member has tried various settings, but the issue persists, and they are still unable to get the desired output. They have reached out to other community members for further assistance, but a definitive solution has not been provided yet.

Useful resources
Hello, I am facing a problem with llama_parse.LlamaParse. It really does not follow the parsing_instruction. Where could I get some help?
1
L
r
S
18 comments
You may have to set is_formatting_instruction=False

It also depends on the mode being used (for example, text mode probably won't use parsing instructions, I think)
Hey! Thanks for reaching out
I am raelly not an expert: my subroutine is like this:
def parse_pdf(pdf_file, output_file):
parsing_instructions = """
The provided pdf document is a multi page scientific article.
The following instructions should be followed:
  • Include the journal title, journal name, and authors ONLY at the beginning of the document as they appear on the first page.
  • Exclude repeated occurrences of the journal title, journal name, and authors on subsequent pages (e.g., in headers or footers).
  • Preserve the logical flow of the document's main content without splitting paragraphs: focus on maintaining text continuity and readability.
  • Exclude non-essential elements such as: page titles, page number, headers and footers.
  • Do not return figures, tables, acknowledgments, funding information and references
  • Do not return any non-ASCII or control characters, publisher details, download information, copyright indications.
  • I repeat, do not return References or Bibliography sections, as they are not part of the main content."""
    parser = LlamaParse(result_type="markdown", parsing_instruction=parsing_instructions,language="en")

    parser = LlamaParse(result_type="markdown",verbose=True,language="en")

    md_data = None
md_data = SimpleDirectoryReader(
input_files=[pdf_file],
required_exts=[".pdf"],
encoding="utf-8",
file_extractor={".pdf": parser}
).load_data()

# Check if md_data is empty or None
if not md_data:
print(f"Error: No data returned when parsing the file '{pdf_file}'. Skipping this file.")
return None

# Proceed to save the data only if parsing was successful
with open(output_file, 'wb') as f:
pickle.dump(md_data, f)

return md_data
Did you try setting it to false though? Its true by default
I just tried both, always comes with all references... its crazy.. its like the instructions are ignored
@Sacha Bron or @pld any ideas? I feel like I've seen this a few times.
Hi @rooooray, can you send me your jobID so I can look at the logs?
Hey, here it is: Started parsing the file under job_id 295a9fee-1a79-461a-9e97-d5252b9c5983
did you find a solution?
hey not really, I have lots of papers to parse, but they all return with bibliography and a lot of things as if my instructions are ignored.
Ol thanks for letting me know. Having similar issues
hey @Logan M and @Sacha Bron anything we can do to try to fix this ? ideas?
You need to try another mode. Fast Mode doesn't produce Markdown and thus will not prompt your instructions.
Try at least in Accurate mode
To test, you can try a prompt like "translate everything in Spanish" so it's more obvious if it works or not
This seems to make it work:

is_formatting_instruction=False,
verbose=True,
disable_ocr=True,
premium_mode=True,
fast_mode=False,
Add a reply
Sign up and join the conversation on Discord