Chunking Markdown Documentation for OpenSearch Insertion

Question

We have our product documentation in markdown format. We'd like to split this doc into chunks based on markdown headers, then insert them on OpenSearch.

Our documentation has a format like:

Header

description

So its important that Header is chunked with its description

The main issue now is chunking, MarkdownNodeParser is not parsing correctly. Any suggestions? should i chunk manually?

WhiteFang_Jr · Answer

You can customize MarkdownNodeParser based on your requirements and then use it.

versa · Answer

sorry, how can i customize it?

WhiteFang_Jr · Answer

You can do something like this:from llama_index.core.node_parser import MarkdownNodeParser # inherit the MarkdownNodeParser class and overide the method which creates nodes from your content.
class YourMarkdownNodeParser(MarkdownNodeParser): def get_nodes_from_node(self, node: BaseNode) -> List[TextNode]: # change this method to fit your needs # You can find current code for this method here
https://github.com/run-llama/llama_index/blob/69716aed521041ccc8dca49952c4fe168691d66d/llama-index-core/llama_index/core/node_parser/file/markdown.py#L37 # once that is done, Use your Parser from there on. parser = YourMarkdownNodeParser()
nodes = parser.get_nodes_from_documents(markdown_docs)

WhiteFang_Jr · Answer

https://github.com/run-llama/llama_index/blob/69716aed521041ccc8dca49952c4fe168691d66d/llama-index-core/llama_index/core/node_parser/file/markdown.py#L37

versa · Answer

appreciate that

Find answers from the community

Chunking Markdown Documentation for OpenSearch Insertion

Header