Has anyone tried writing a more involved

NNick Darnell

Has anyone tried writing a more involved node parser? It seems like you'd probably get much better results this way, for example. I've been thinking about having a MarkdownNodeParser than understands headings and such, and breaks on those before it splits on chunks, and perhaps keeps the section hierarchy you're currently nested within as metadata text, tbd there.

26 comments

YYi Ding

Great idea! Would love to see it!

NNick Darnell

It looks like it's in Langchain already, didn't realize llamaindex just used langchain splitters, https://github.com/langchain-ai/langchain/blob/5c6dcb1960b717aaf70413ed0b467bffc4fc0be8/libs/langchain/langchain/text_splitter.py#L292

YYi Ding

Try it out and let us know how well it works. If there are specific issues with markdown that it's not doing well we can go in and try to replicate/fix it.

YYi Ding

We support Langchain splitters but don't mind writing our own also.

NNick Darnell

Though im surprised by these implementations, it's a little odd to me that the splitters are not hierarchical in nature, or at least the NodeParser. Like It doesn't matter if it splits on headers, you still need to ensure the nodes it generates chunk below some acceptable level of tokens

NNick Darnell

So maybe the SimpleNodeParser should have an accompanying HierarchicalNodeParser?

NNick Darnell

split using one splitter, then attempt to split the nodes again by the next one, so on and so forth

YYi Ding

That's a good point. Maybe we can do a better one using this structure you're talking about.

NNick Darnell

Would be nice

NNick Darnell

In the meantime, ill hack something together to see where i end up

YYi Ding

Keep us updated!

NNick Darnell

Well first attempt failed, interplay here is very confusing. the llama index stuff looks like it uses the langchain text splitters, but they use two different APIs, one returns Documents the other TextSplits.

NNick Darnell

Plain Text

def get_text_splits_from_document(
    document: BaseNode,
    text_splitter: TextSplitter,
    include_metadata: bool = True,
) -> List[TextSplit]:
    """Break the document into chunks with additional info."""
    # TODO: clean up since this only exists due to the diff w LangChain's TextSplitter
    if isinstance(text_splitter, TokenTextSplitter):
        # use this to extract extra information about the chunks
        text_splits = text_splitter.split_text_with_overlaps(
            document.get_content(metadata_mode=MetadataMode.NONE),
            metadata_str=document.get_metadata_str() if include_metadata else None,
        )
    else:
        text_chunks = text_splitter.split_text(
            document.get_content(),
        )
        text_splits = [TextSplit(text_chunk=text_chunk) for text_chunk in text_chunks]

    return text_splits

this logic is no good, if you actually implement it where you return a [str], it will work, but if you implement your textSplitter to return [TextSplit]'s so that each split can potentially contain metadata, like what the header stack was for this chunk of text, you can't.

Id' probably modify

Plain Text

text_chunks = text_splitter.split_text(
    document.get_content(),
)
text_splits = [TextSplit(text_chunk=text_chunk) for text_chunk in text_chunks]

so that it doesn't assume it's a string, but actually attempts to handle strings, documents or TextSplits being returned

NNick Darnell

even though the abstract base function in langchain returns [str] it's not a hard and fast rule, some of theirs return [document]s instead

YYi Ding

@Logan M

LLogan M

@Nick Darnell a little lost lol is the issue that langchain splitters don't return strings? Could probably add an isinstance check for langchains base splitter class to handle this nicely?

LLogan M

I guess they must have returned strings at some point in the past, and this just wasn't maintained well and snuck by 😅

NNick Darnell

They sometimes return strings

NNick Darnell

They sometimes return documents

NNick Darnell

Basically whenever the splitter needs to include metadata

LLogan M

Well that's inconsistant haha

LLogan M

good to know!

NNick Darnell

@Logan M I kinda wish someone make another library just for writing splitters with the expectations of chaining them. As it stands now, i can generate one into another, but even though the new things are nodes, it expects documents.

A library that Assumed you get N TextSplits, (the first parser getting 1), and then having every parser only consume and output TextSplits would be ideal

NNick Darnell

Also makes more sense if you have mixed media documents

NNick Darnell

So maybe you instead of TextSpits, MediaSplits, if you only understand TextSplits you ignore anything else

LLogan M

Definitely open to some contributions around this. We definitely could be putting more thought into how the splitters work

Add a reply

Find answers from the community

Has anyone tried writing a more involved