Find answers from the community

Updated 3 weeks ago

Parsing instructions

What should I set my Llamaparse object arguments, so my parsing_instruction make any difference:
Plain Text
  parser = LlamaParse(
            api_key=LLAMAINDEX_API_KEY,
            parsing_instruction=parsing_instruction,
            premium_mode=True,
            split_by_page=False,
            verbose=False
        )
        file_extractor = {".pdf": parser, ".docx": parser, ".doc": parser}
        documents = SimpleDirectoryReader(
            input_files=[file_path], file_extractor=file_extractor
        ).load_data()

        document_text = "\n".join(
            [doc.text for doc in documents if hasattr(doc, 'text')]
        )


Doesnt make here. I think llama havent implemented this feature yet, but they shipped it πŸ˜„
L
L
S
35 comments
They work. You might need to set is_formatting_instruction=False though
Thanks for getting back to me. I'll try it right now.

Does the parsing_instruction work with images?

I've got a financial data, which includes the images, I would want to mock my custom prompt
Still does't work
tested with and without arguments, got same result
I use it pretty extensively, works for me. Maybe also check your llama-parse version, pip install-U llama-parse
Also, not sure what your instructions are, but you might need to get more creative
Plain Text
    parsing_instruction = """
You are a highly proficient language model designed to convert pages from PDF, PPT and other files into structured markdown text. Your goal is to accurately transcribe text, represent formulas in LaTeX MathJax notation, and identify and describe images, particularly graphs and other graphical elements.

You have been tasked with creating a markdown copy of each page from the provided PDF or PPT image. Each image description must include a full description of the content, a summary of the graphical object.

Maintain the sequence of all the elements.

For the following element, follow the requirement of extraction:
for Text:
   - Extract all readable text from the page.
   - Exclude any diagonal text, headers, and footers.

for Text which includes hyperlink:
    -Extract hyperlink and present it with the text
    
for Formulas:
   - Identify and convert all formulas into LaTeX MathJax notation.

for Image Identification and Description:
- For each image and graph, REPLACE IMAGE WITH <IMAGE/> TAG!!!!!!!!!!!!!!!!!!!!!

    
# OUTPUT INSTRUCTIONS

- Ensure all formulas are in LaTeX MathJax notation.
- Exclude any diagonal text, headers, and footers from the output.
- For each image and graph, REPLACE IMAGE WITH <IMAGE/> TAG!!!!!!!!!!!!!!!!!!!!!
"""
Can not be more explicit than this :/
I will update to 0.5.13
Attachment
image.png
Spent quite some time iterating on prompt, still isnt working..
does parsing_instruction only affect textual context or pre-OCR aswell?
It affects everything that happens after ocr

Passed it along to @pld / @Sacha Bron
Hey @pld @Sacha Bron, any updates? Thanks
Do you have the jobIDs of your requests?
Are you sure?
Attachment
image.png
This is from your web UI
@Sacha Bron web UI is working with custom parsing instructions
I was just giving options to try. I don't actually know what the issue is in your case

If you can share some jobIDs that aren't working as expected, @Sacha Bron can look into it
@Sacha Bron can you please reachout to me privately? I dont want to share jobid publicly . Thanks
I cannot send you a Discord message nor add you as a friend because of your Discord settings. Can you dm me instead?
@Logan M @Sacha Bron @pld possibly bug, you should look into it
Plain Text
parser = LlamaParse(
            api_key=LLAMAINDEX_API_KEY,
            parsing_instruction=parsing_instruction,
            premium_mode=True,
            is_formatting_instruction=True,
            invalidate_cache=True,
            do_not_cache=True,
            result_type="markdown"
        )
This setup works
Something is shadowing the parsing_instruction from the initial setup
Either update docs to include it
PR for bug fix πŸ˜„
Something is shadowing the parsing_instruction from the initial setup -- I don't know what that means πŸ‘€
@Lvka I just double-checked and is_formatting_instruction is True by default in the latest version of the llama_parse lib. Like on the UI.
@Lvka Ah, I found your issue. You need to use the Markdown result for you parsing instruction to be shown: result_type="markdown",
Cool! Would be helpful to include that required explicit argument within a docs
I agree with you. We'll probably refactor part of the lib soonℒ️
Add a reply
Sign up and join the conversation on Discord