Find answers from the community

Updated 3 months ago

I have a markdown file in which there are texts and table. Can you give me suggestions how

I have a markdown file in which there are texts and table. Can you give me suggestions how can I ensure that my table is not chunked in the middle ? I want to ensure that the table is still intact
k
i
6 comments
Please stand by @ihshan :) as I search the knowledge base. I normally take about 20 seconds or so. :wave:
----------------------
The LlamaIndex's MarkdownElementNodeParser class provides a way to parse markdown files and handle tables in a specific way. It checks if the markdown table has the same number of columns in each row and at least two rows. If these conditions are met, the table is considered a "perfect table" and is converted to a dataframe. If the table doesn't meet these conditions, it's considered a "non-perfect table", and the raw text is stored instead of converting it to a dataframe.

Here's a brief overview of how it works:

  1. The parser reads each line of the markdown file.
  2. If a line starts with "|", it's considered as part of a table.
  3. The parser checks if the table has the same number of columns in each row and at least two rows.
  4. If the table meets these conditions, it's converted to a dataframe.
  5. If the table doesn't meet these conditions, the raw text is stored.
To ensure that your table is not chunked in the middle, you should ensure that your table has the same number of columns in each row and at least two rows. This way, the parser will recognize it as a "perfect table" and convert it to a dataframe, keeping it intact.

Here's the relevant code snippet from the MarkdownElementNodeParser class:
Plain Text
table_lines = element.element.split("\n")
table_columns = [len(line.split("|")) for line in table_lines]
if len(set(table_columns)) > 1:
    # if the table have different number of columns on each rows, it's not a perfect table
    # we will store the raw text for such tables instead of converting them to a dataframe
    perfect_table = False
# verify that the table (markdown) have at least 2 rows
if len(table_lines) < 2:
    should_keep = False
# apply the table filter, now only filter empty tables
if should_keep and perfect_table and table_filters is not None:
    should_keep = all(tf(element) for tf in table_filters)
# if the element is a table, convert it to a dataframe
if should_keep:
    if perfect_table:
        table = md_to_df(element.element)
        elements[idx] = Element(
            id=f"id_{node_id}_{idx}" if node_id else f"id_{idx}",
            type="table",
            element=element.element,
            table=table,
        )
    else:
        # for non-perfect tables, we will store the raw text
        # and give it a different type to differentiate it from perfect tables
        elements[idx] = Element(
            id=f"id_{node_id}_{idx}" if node_id else f"id_{idx}",
            type="table_text",
            element=element.element,
        )


You can find more details in the LlamaIndex documentation ().
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------
Give me example how to extract table as well as text from markdown file then load it into document @kapa.ai
Let me check the knowledge base for you. One moment! :books:
----------------------
To extract tables and text from a markdown file and load it into a document, you can use the MarkdownElementNodeParser class from LlamaIndex. Here's an example of how you can do it:

First, you need to initialize the MarkdownElementNodeParser:

Plain Text
from llama_index.core.node_parser import MarkdownElementNodeParser

md_node_parser = MarkdownElementNodeParser()


Then, you can use the get_nodes_from_documents method to extract elements from the markdown documents:

Plain Text
md_nodes = md_node_parser.get_nodes_from_documents(md_documents)


In the above code, md_documents is a list of Document objects that you want to parse. You can load markdown files into Document objects using the MarkdownReader class from LlamaIndex:

Plain Text
from llama_index.readers.file import MarkdownReader
from pathlib import Path

reader = MarkdownReader()
md_documents = reader.load_data(Path("your_markdown_file.md"))


In the above code, replace "your_markdown_file.md" with the path to your markdown file.

The get_nodes_from_documents method will return a list of Node objects. Each Node object represents a section of the markdown document, and it contains a list of Element objects. Each Element object represents a piece of content in the section, such as a paragraph of text or a table.

You can find more details in the LlamaIndex documentation ().
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------
Add a reply
Sign up and join the conversation on Discord