Hi team , when working with Markdown parser I got an er...

At a glance

The community member is encountering a memory issue when trying to parse a large Markdown document using the LlamaIndex library. The issue arises when calling the get_nodes_from_documents() method. To address this, the community members are provided with two suggestions:

1. Split the document into smaller chunks using the SentenceSplitter or TokenTextSplitter classes from LlamaIndex. This can be done by setting the chunk_size parameter.

2. Use the IngestionPipeline class to chain the parsing and splitting operations, using SimpleFileNodeParser to parse the document and SentenceSplitter to split it into smaller chunks.

The community members also encounter an EOF error when running the pipeline.run() method. To address this, they are suggested to use a Markdown parser and other transformations to prevent the memory issue.

Additionally, the community members are provided with information on how to check if a document is a valid Markdown schema. This can be done using a Markdown parser, which will return an error if the document is not valid Markdown

AAnsuman Satapathy

@kapa.ai Hi team , when working with Markdown parser I got an error when it encountered the statement

nodes = node_parser.get_nodes_from_documents(documents)

My documents is a llama-parse output for a 75 page document with some additional text in it. Getting an error in low memory , any way out of this ?

10 comments

kkapa.ai

Please stand by @Ansuman Satapathy as I search the knowledge base. I normally take about 20 seconds or so. :wave:

kkapa.ai

----------------------
It seems like you're encountering a memory issue when trying to parse a large document using the Markdown parser in LlamaIndex. This could be due to the size of the document and the memory capacity of your machine.

One way to address this issue is to split your document into smaller chunks before parsing. You can use the SentenceSplitter or TokenTextSplitter classes provided by LlamaIndex to split your document into smaller parts. Here's an example of how to use SentenceSplitter:

Plain Text

from llama_index.core.node_parser import SentenceSplitter

node_parser = SentenceSplitter(chunk_size=256)
nodes = node_parser.get_nodes_from_documents(documents)

In this example, chunk_size=256 specifies the maximum number of tokens in each chunk. You can adjust this value based on your memory capacity.

Another approach is to use the IngestionPipeline class to chain the parsing and splitting operations. Here's an example:

Plain Text

from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import SimpleFileNodeParser, SentenceSplitter

pipeline = IngestionPipeline(
    documents=reader.load_data(Path("./README.md")),
    transformations=[
        SimpleFileNodeParser(),
        SentenceSplitter(chunk_size=200, chunk_overlap=0),
    ],
)

md_chunked_nodes = pipeline.run()
print(md_chunked_nodes)

In this example, SimpleFileNodeParser is used to parse the document, and SentenceSplitter is used to split the parsed document into smaller chunks.

Please note that these examples are based on the information provided in the extracts. You may need to adjust the code to fit your specific use case.

kkapa.ai

For more information, you can refer to the following sources:

Building a (Very Simple) Vector Store from Scratch ()
File Based Node Parsers ()
Node Parser Modules ()
Node Parser Package ()

----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------

AAnsuman Satapathy

@kapa.ai can you rewrite using a markdown parser and some other transformation to prevent memory issue . When the system encounters pipeline.run it it throwing a EOF error

kkapa.ai

Let me check the knowledge base for you. One moment! :books:

kkapa.ai

----------------------
Sure, you can use the IngestionPipeline class in LlamaIndex to chain a Markdown parser and other transformations. This can help manage memory usage by processing the document in smaller chunks. Here's an example:

Plain Text

from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import SimpleFileNodeParser, SentenceSplitter
from llama_index.embeddings.openai import OpenAIEmbedding

# Initialize the OpenAI embedding model
embed_model = OpenAIEmbedding(embed_batch_size=10)

# Define the transformations
transformations_example = [
    SimpleFileNodeParser(),
    SentenceSplitter(chunk_size=200, chunk_overlap=0),
    embed_model,
]

# Initialize the ingestion pipeline
pipeline = IngestionPipeline(
    documents=reader.load_data(Path("./README.md")),
    transformations=transformations_example,
)

# Run the pipeline
md_chunked_nodes = pipeline.run()
print(md_chunked_nodes)

In this example, SimpleFileNodeParser is used to parse the Markdown document, SentenceSplitter is used to split the parsed document into smaller chunks, and OpenAIEmbedding is used to embed the chunks. The chunk_size and chunk_overlap parameters of SentenceSplitter can be adjusted based on your memory capacity.

Please note that this example is based on the information provided in the extracts. You may need to adjust the code to fit your specific use case.

kkapa.ai

For more information, you can refer to the following sources:

File Based Node Parsers ()
Advanced Ingestion Pipeline ()
Node Parser Modules ()
Node Parser Package ()

----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------

AAnsuman Satapathy

@kapa.ai how to check if the document is a valid markdown schema

kkapa.ai

Let me check the knowledge base for you. One moment! :books:

kkapa.ai

----------------------
To check if a document is a valid Markdown schema, you can use a Markdown parser. A Markdown parser will parse the document and return an error if the document is not valid Markdown.

However, the extracts provided do not contain specific information or code examples on how to validate a Markdown schema using LlamaIndex or any other tool.

For general Markdown validation, you can use online tools like Dillinger () or StackEdit (). These tools allow you to paste your Markdown text and see if it renders correctly.

For more complex validation, such as checking if the document follows a specific Markdown schema, you might need to use a Markdown parser library in a programming language like Python or JavaScript. Libraries like markdown-it-py () for Python or markdown-it () for JavaScript can parse Markdown text and provide detailed error messages if the text is not valid Markdown.

Please note that these are general suggestions and may not be applicable to your specific use case. For more detailed guidance, you might need to refer to the documentation of the tool or library you are using.
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------

Add a reply

Find answers from the community

Hi team , when working with Markdown parser I got an error when it encountered the stateme