Find answers from the community

Updated 7 days ago

Llama parse does not follow parsing instruction

Hello, I am facing a problem with llama_parse.LlamaParse. It really does not follow the parsing_instruction. Where could I get some help?
1
L
r
S
18 comments
You may have to set is_formatting_instruction=False

It also depends on the mode being used (for example, text mode probably won't use parsing instructions, I think)
Hey! Thanks for reaching out
I am raelly not an expert: my subroutine is like this:
def parse_pdf(pdf_file, output_file):
parsing_instructions = """
The provided pdf document is a multi page scientific article.
The following instructions should be followed:
  • Include the journal title, journal name, and authors ONLY at the beginning of the document as they appear on the first page.
  • Exclude repeated occurrences of the journal title, journal name, and authors on subsequent pages (e.g., in headers or footers).
  • Preserve the logical flow of the document's main content without splitting paragraphs: focus on maintaining text continuity and readability.
  • Exclude non-essential elements such as: page titles, page number, headers and footers.
  • Do not return figures, tables, acknowledgments, funding information and references
  • Do not return any non-ASCII or control characters, publisher details, download information, copyright indications.
  • I repeat, do not return References or Bibliography sections, as they are not part of the main content."""
    parser = LlamaParse(result_type="markdown", parsing_instruction=parsing_instructions,language="en")

    parser = LlamaParse(result_type="markdown",verbose=True,language="en")

    md_data = None
md_data = SimpleDirectoryReader(
input_files=[pdf_file],
required_exts=[".pdf"],
encoding="utf-8",
file_extractor={".pdf": parser}
).load_data()

# Check if md_data is empty or None
if not md_data:
print(f"Error: No data returned when parsing the file '{pdf_file}'. Skipping this file.")
return None

# Proceed to save the data only if parsing was successful
with open(output_file, 'wb') as f:
pickle.dump(md_data, f)

return md_data
Did you try setting it to false though? Its true by default
I just tried both, always comes with all references... its crazy.. its like the instructions are ignored
@Sacha Bron or @pld any ideas? I feel like I've seen this a few times.
Hi @rooooray, can you send me your jobID so I can look at the logs?
Hey, here it is: Started parsing the file under job_id 295a9fee-1a79-461a-9e97-d5252b9c5983
did you find a solution?
hey not really, I have lots of papers to parse, but they all return with bibliography and a lot of things as if my instructions are ignored.
Ol thanks for letting me know. Having similar issues
hey @Logan M and @Sacha Bron anything we can do to try to fix this ? ideas?
You need to try another mode. Fast Mode doesn't produce Markdown and thus will not prompt your instructions.
Try at least in Accurate mode
To test, you can try a prompt like "translate everything in Spanish" so it's more obvious if it works or not
This seems to make it work:

is_formatting_instruction=False,
verbose=True,
disable_ocr=True,
premium_mode=True,
fast_mode=False,
Add a reply
Sign up and join the conversation on Discord