Find answers from the community

Updated 2 weeks ago

Chunking Markdown Documentation for OpenSearch Insertion

We have our product documentation in markdown format. We'd like to split this doc into chunks based on markdown headers, then insert them on OpenSearch.

Our documentation has a format like:

Header

description

So its important that Header is chunked with its description

The main issue now is chunking, MarkdownNodeParser is not parsing correctly. Any suggestions? should i chunk manually?
W
v
5 comments
You can customize MarkdownNodeParser based on your requirements and then use it.
sorry, how can i customize it?
You can do something like this:

Plain Text
from llama_index.core.node_parser import MarkdownNodeParser

# inherit the MarkdownNodeParser class and overide the method which creates nodes from your content.
class YourMarkdownNodeParser(MarkdownNodeParser):

  def get_nodes_from_node(self, node: BaseNode) -> List[TextNode]:
    # change this method to fit your needs

# You can find current code for this method here
https://github.com/run-llama/llama_index/blob/69716aed521041ccc8dca49952c4fe168691d66d/llama-index-core/llama_index/core/node_parser/file/markdown.py#L37

# once that is done, Use your Parser from there on.

parser = YourMarkdownNodeParser()
nodes = parser.get_nodes_from_documents(markdown_docs)
appreciate that
Add a reply
Sign up and join the conversation on Discord