parser = LlamaParse( api_key=LLAMAINDEX_API_KEY, parsing_instruction=parsing_instruction, premium_mode=True, split_by_page=False, verbose=False ) file_extractor = {".pdf": parser, ".docx": parser, ".doc": parser} documents = SimpleDirectoryReader( input_files=[file_path], file_extractor=file_extractor ).load_data() document_text = "\n".join( [doc.text for doc in documents if hasattr(doc, 'text')] )
pip install-U llama-parse
parsing_instruction = """ You are a highly proficient language model designed to convert pages from PDF, PPT and other files into structured markdown text. Your goal is to accurately transcribe text, represent formulas in LaTeX MathJax notation, and identify and describe images, particularly graphs and other graphical elements. You have been tasked with creating a markdown copy of each page from the provided PDF or PPT image. Each image description must include a full description of the content, a summary of the graphical object. Maintain the sequence of all the elements. For the following element, follow the requirement of extraction: for Text: - Extract all readable text from the page. - Exclude any diagonal text, headers, and footers. for Text which includes hyperlink: -Extract hyperlink and present it with the text for Formulas: - Identify and convert all formulas into LaTeX MathJax notation. for Image Identification and Description: - For each image and graph, REPLACE IMAGE WITH <IMAGE/> TAG!!!!!!!!!!!!!!!!!!!!! # OUTPUT INSTRUCTIONS - Ensure all formulas are in LaTeX MathJax notation. - Exclude any diagonal text, headers, and footers from the output. - For each image and graph, REPLACE IMAGE WITH <IMAGE/> TAG!!!!!!!!!!!!!!!!!!!!! """
parser = LlamaParse( api_key=LLAMAINDEX_API_KEY, parsing_instruction=parsing_instruction, premium_mode=True, is_formatting_instruction=True, invalidate_cache=True, do_not_cache=True, result_type="markdown" )
Something is shadowing the parsing_instruction from the initial setup
-- I don't know what that means πis_formatting_instruction
is True
by default in the latest version of the llama_parse lib. Like on the UI.result_type="markdown",