LlamaIndex

Log inLog into community

Find answers from the community

Updated 4 months ago

Parsing instructions

Parsing instructions

At a glance

The community member is having trouble getting their parsing_instruction to work with the LlamaParse object. They have tried various configurations, including setting is_formatting_instruction=False and updating the llama-parse library, but the issue persists. Other community members suggest that the parsing_instruction may only affect the textual context and not the pre-OCR processing, and that the result_type="markdown" argument may be required for the parsing instruction to be properly applied. The community members continue to investigate the issue and provide suggestions, but there is no explicitly marked answer.

·

What should I set my Llamaparse object arguments, so my parsing_instruction make any difference:

Plain Text

  parser = LlamaParse(
            api_key=LLAMAINDEX_API_KEY,
            parsing_instruction=parsing_instruction,
            premium_mode=True,
            split_by_page=False,
            verbose=False
        )
        file_extractor = {".pdf": parser, ".docx": parser, ".doc": parser}
        documents = SimpleDirectoryReader(
            input_files=[file_path], file_extractor=file_extractor
        ).load_data()

        document_text = "\n".join(
            [doc.text for doc in documents if hasattr(doc, 'text')]
        )

Doesnt make here. I think llama havent implemented this feature yet, but they shipped it 😄

L

L

S

35 comments

They work. You might need to set is_formatting_instruction=False though

Thanks for getting back to me. I'll try it right now.

Does the parsing_instruction work with images?

I've got a financial data, which includes the images, I would want to mock my custom prompt

Still does't work

tested with and without arguments, got same result

I use it pretty extensively, works for me. Maybe also check your llama-parse version, pip install-U llama-parse

Also, not sure what your instructions are, but you might need to get more creative

Plain Text

    parsing_instruction = """
You are a highly proficient language model designed to convert pages from PDF, PPT and other files into structured markdown text. Your goal is to accurately transcribe text, represent formulas in LaTeX MathJax notation, and identify and describe images, particularly graphs and other graphical elements.

You have been tasked with creating a markdown copy of each page from the provided PDF or PPT image. Each image description must include a full description of the content, a summary of the graphical object.

Maintain the sequence of all the elements.

For the following element, follow the requirement of extraction:
for Text:
   - Extract all readable text from the page.
   - Exclude any diagonal text, headers, and footers.

for Text which includes hyperlink:
    -Extract hyperlink and present it with the text
    
for Formulas:
   - Identify and convert all formulas into LaTeX MathJax notation.

for Image Identification and Description:
- For each image and graph, REPLACE IMAGE WITH <IMAGE/> TAG!!!!!!!!!!!!!!!!!!!!!

    
# OUTPUT INSTRUCTIONS

- Ensure all formulas are in LaTeX MathJax notation.
- Exclude any diagonal text, headers, and footers from the output.
- For each image and graph, REPLACE IMAGE WITH <IMAGE/> TAG!!!!!!!!!!!!!!!!!!!!!
"""

Can not be more explicit than this :/

0.5.11

I will update to 0.5.13

Attachment

Spent quite some time iterating on prompt, still isnt working..

does parsing_instruction only affect textual context or pre-OCR aswell?

It affects everything that happens after ocr

Passed it along to @pld / @Sacha Bron

Hey @pld @Sacha Bron, any updates? Thanks

Do you have the jobIDs of your requests?

Are you sure?

Attachment

This is from your web UI

@Sacha Bron web UI is working with custom parsing instructions

I was just giving options to try. I don't actually know what the issue is in your case

If you can share some jobIDs that aren't working as expected, @Sacha Bron can look into it

@Sacha Bron can you please reachout to me privately? I dont want to share jobid publicly . Thanks

I cannot send you a Discord message nor add you as a friend because of your Discord settings. Can you dm me instead?

@Logan M @Sacha Bron @pld possibly bug, you should look into it

Plain Text

parser = LlamaParse(
            api_key=LLAMAINDEX_API_KEY,
            parsing_instruction=parsing_instruction,
            premium_mode=True,
            is_formatting_instruction=True,
            invalidate_cache=True,
            do_not_cache=True,
            result_type="markdown"
        )

This setup works

Something is shadowing the parsing_instruction from the initial setup

Either update docs to include it

or

PR for bug fix 😄

Something is shadowing the parsing_instruction from the initial setup -- I don't know what that means 👀

@Lvka I just double-checked and is_formatting_instruction is True by default in the latest version of the llama _parse lib. Like on the UI.

@Lvka Ah, I found your issue. You need to use the Markdown result for you parsing instruction to be shown: result_type="markdown",

Cool! Would be helpful to include that required explicit argument within a docs

I agree with you. We'll probably refactor part of the lib soon™️

Add a reply

Sign up and join the conversation on Discord