----------------------
It seems like you're encountering a memory issue when trying to parse a large document using the Markdown parser in LlamaIndex. This could be due to the size of the document and the memory capacity of your machine.
One way to address this issue is to split your document into smaller chunks before parsing. You can use the
SentenceSplitter
or
TokenTextSplitter
classes provided by LlamaIndex to split your document into smaller parts. Here's an example of how to use
SentenceSplitter
:
from llama_index.core.node_parser import SentenceSplitter
node_parser = SentenceSplitter(chunk_size=256)
nodes = node_parser.get_nodes_from_documents(documents)
In this example,
chunk_size=256
specifies the maximum number of tokens in each chunk. You can adjust this value based on your memory capacity.
Another approach is to use the
IngestionPipeline
class to chain the parsing and splitting operations. Here's an example:
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import SimpleFileNodeParser, SentenceSplitter
pipeline = IngestionPipeline(
documents=reader.load_data(Path("./README.md")),
transformations=[
SimpleFileNodeParser(),
SentenceSplitter(chunk_size=200, chunk_overlap=0),
],
)
md_chunked_nodes = pipeline.run()
print(md_chunked_nodes)
In this example,
SimpleFileNodeParser
is used to parse the document, and
SentenceSplitter
is used to split the parsed document into smaller chunks.
Please note that these examples are based on the information provided in the extracts. You may need to adjust the code to fit your specific use case.