I have microsoft word documents, how do I extract the d...

At a glance

The post asks how to extract the content of Microsoft Word documents into Markdown format, including the images. The comments provide a solution using the DocxReader and MarkdownReader classes from LlamaIndex. The community members also discuss handling large amounts of data with the Document Summary Index and the "tree_summarize" response mode, as well as potential issues with context window limits and suggestions to address them, such as using a reranker, adjusting the response mode, limiting the context window, or considering a different index type.

ccmosguy

@kapa.ai I have microsoft word documents, how do I extract the document content into a markdown and include the images as well?

14 comments

kkapa.ai

Please stand by @cmosguy as I search the knowledge base. I normally take about 15 seconds or so. :wave:

kkapa.ai

----------------------
To extract content from Microsoft Word documents into markdown format, including images, you can use the DocxReader and MarkdownReader classes from LlamaIndex. Here's how you can do it:

First, use the DocxReader to parse the Word document:

Plain Text

from llama_index.readers.file import DocxReader

docx_reader = DocxReader()
documents = docx_reader.load_data("path/to/your/document.docx")

Then, use the MarkdownReader to convert the content to markdown format:

Plain Text

from llama_index.readers.file import MarkdownReader

markdown_reader = MarkdownReader(remove_hyperlinks=False, remove_images=False)
markdown_content = markdown_reader.markdown_to_tups(documents[0].text)

By setting remove_images=False, the MarkdownReader will preserve image references in the markdown output.

To actually include the images, you'll need to extract them separately and save them to a location accessible by your markdown renderer. LlamaIndex doesn't provide a built-in method for this, but you can use a library like python-docx to extract images from the Word document.

It's worth noting that the exact process might vary depending on your specific requirements and the structure of your Word documents. The LlamaIndex documentation doesn't provide a complete end-to-end example for this specific use case, but it does offer the building blocks to create such a solution.

For more detailed information on using these readers, you can refer to the following documentation:

kkapa.ai

Remember that handling images and ensuring they are correctly referenced in the markdown output may require additional processing steps not covered in the basic LlamaIndex functionality.
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------

WWhiteFang_Jr

@cmosguy try llamaparse!

ccmosguy

thanks @WhiteFang_Jr i guess i could try that

ccmosguy

@WhiteFang_Jr is there a way to do llamaparse on prem - I cannot send data across th wire

ccmosguy

@kapa.ai i do not understand the document summary index:

Plain Text

# default mode of building the index
response_synthesizer = get_response_synthesizer(
    response_mode="tree_summarize", use_async=True
)
doc_summary_index = DocumentSummaryIndex.from_documents(
    city_docs,
    llm=chatgpt,
    transformations=[splitter],
    response_synthesizer=response_synthesizer,
    show_progress=True,
)

what do I do if I have a many thousands of nodes, how does the summary index reduce the information down so that it does not fill up the context window?
what does the tree_summarize do? what are the different response_modes ?

kkapa.ai

Let me check the knowledge base for you again. One moment! :books:

kkapa.ai

----------------------
I understand your questions about the Document Summary Index and the tree_summarize response mode. Let me break down the answers for you based on the information from the LlamaIndex documentation.

Handling many thousands of nodes:

The Document Summary Index is designed to handle large numbers of nodes efficiently. It does this by creating summaries for each document, which helps reduce the amount of information that needs to be processed during query time.

When you create the index, it generates a summary for each document:

Plain Text

summary_response = self._response_synthesizer.synthesize(
    query=self._summary_query,
    nodes=nodes_with_scores,
)

This summary is then stored and associated with the original document nodes LlamaIndex > En > Latest > Api _reference > Indices > Document summary > DocumentSummaryIndex Part 3.

Tree_summarize response mode:

The "tree_summarize" response mode is a method of generating responses that is particularly effective for summarization tasks. Here's how it works:

It builds a tree index over the set of candidate nodes, with a summary prompt seeded with the query.
The tree is constructed from the bottom up.
The root node of this tree is returned as the final response.

This approach allows for handling large amounts of information by breaking it down into a hierarchical structure, which can be more efficiently processed and summarized LlamaIndex > En > Latest > Api _reference > Response _synthesizers > ResponseMode.

Different response modes:

LlamaIndex offers several response modes for the response synthesizer:

kkapa.ai

REFINE: An iterative method that refines the answer through multiple nodes.
COMPACT: Combines text chunks into larger consolidated chunks before refining.
SIMPLE_SUMMARIZE: Merges all text chunks into one and makes a single LLM call.
TREE_SUMMARIZE: Builds a tree index over the nodes and returns the root node as the response.
GENERATION: Uses the LLM to generate a response without context.
NO_TEXT: Returns the retrieved context nodes without synthesizing a response.
CONTEXT_ONLY: Returns a concatenated string of all text chunks.
ACCUMULATE: Synthesizes a response for each text chunk and concatenates the results.
COMPACT_ACCUMULATE: Combines chunks before accumulating answers.

Each mode has its own use case and trade-offs in terms of speed, accuracy, and context utilization LlamaIndex > En > Latest > Api _reference > Response _synthesizers > ResponseMode.

The tree_summarize mode is particularly useful for summarization tasks and handling large amounts of information, as it creates a hierarchical summary structure that can effectively condense information without losing important details.

For your use case with thousands of nodes, the Document Summary Index with tree_summarize mode should be able to handle the volume by creating summaries of each document and then organizing these summaries into a tree structure for efficient querying and response generation.
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------

ccmosguy

@kapa.ai so i keep hitting an issue:

Plain Text

vector_query_engine = vector_index.as_query_engine(
            llm=llm_4o_mini,
            similarity_top_k=20,
            node_postprocessors=node_postprocessors,
        )
        summary_query_engine = summary_index.as_query_engine(
            llm=llm_4o_mini,
        )
        query_engine_tools: List[QueryEngineTool] = [
            QueryEngineTool(
                query_engine=vector_query_engine,
                metadata=ToolMetadata(
                    name=vector_tool_name,
                    description=vector_tool_description,
                ),
            ),
            QueryEngineTool(
                query_engine=summary_query_engine,
                metadata=ToolMetadata(
                    name=summary_tool_name,
                    description=summary_tool_description,
                ),
            ),
        ]
        all_query_engine_tools.extend(query_engine_tools)
 agent = OpenAIAgent.from_tools(
            tools=all_query_engine_tools,
            llm=llm_4o_2,
            verbose=True,
            system_prompt="""\
    You are a specialized agent designed to answer queries about the foo bar software system. 
    You must ALWAYS use at least one of the tools provided when writing an answer; do NOT rely on prior knowledge.\
    """,
        )

when I ran a agent.query() i keep hitting context window limits with the summary index. Is there something wrong with how I setup the summary index? should I be using additional arguments or reranker in the argumetns for the query engine?

kkapa.ai

Let me check the knowledge base for you again. One moment! :books:

kkapa.ai

----------------------
Based on the information provided in the knowledge sources, I can offer some insights and suggestions to address your issue with hitting context window limits when using the summary index in your agent query.

The problem you're encountering is likely related to how the summary index is handling large amounts of information. Here are some potential solutions and modifications you can try:

Use a reranker:

You can add a reranker to your node postprocessors to help prioritize the most relevant information. This can help reduce the amount of content being passed to the context window. For example, you could use the SentenceTransformerRerank:

Plain Text

from llama_index.core.postprocessor import SentenceTransformerRerank

rerank = SentenceTransformerRerank(
    model="cross-encoder/ms-marco-MiniLM-L-2-v2", top_n=3
)

node_postprocessors = [rerank]

Then, apply this to your summary query engine:

Plain Text

summary_query_engine = summary_index.as_query_engine(
    llm=llm_4o_mini,
    node_postprocessors=node_postprocessors
)

LlamaIndex > En > Latest > Examples > Node _postprocessor > SentenceTransformerRerank

Adjust the response mode:

For the summary index, you might want to explicitly set the response mode to "tree_summarize" and enable async processing:

Plain Text

summary_query_engine = summary_index.as_query_engine(
    llm=llm_4o_mini,
    response_mode="tree_summarize",
    use_async=True
)

This can help in handling large amounts of information more efficiently LlamaIndex > En > Latest > Examples > Agent > Build Agentic RAG with Llamaindex for Vertex AI > Task 3 : Building an Agent Reasoning Loop.

kkapa.ai

Limit the context window:

If you're still hitting context window limits, you might need to explicitly set a smaller context window for your LLM. For example:

Plain Text

from llama_index.core import Settings

Settings.context_window = 2048  # or any other appropriate value

This will force the system to work within a smaller context window, which might help avoid hitting limits LlamaIndex > En > Latest > Examples > Finetuning > React _agent > limit the context window artifically to test refine process Part 1.

Use a different index type:

If the summary index is still causing issues, you might want to consider using a different index type, such as a vector index with chunking, which might be more suitable for handling very large documents.

Remember, the exact solution might depend on the specific characteristics of your data and the requirements of your application. You may need to experiment with different combinations of these approaches to find the optimal setup for your use case.
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------

Add a reply

Find answers from the community

I have microsoft word documents, how do I extract the document content into a markdown and