Find answers from the community

Updated last year

Has anyone tried writing a more involved

Has anyone tried writing a more involved node parser? It seems like you'd probably get much better results this way, for example. I've been thinking about having a MarkdownNodeParser than understands headings and such, and breaks on those before it splits on chunks, and perhaps keeps the section hierarchy you're currently nested within as metadata text, tbd there.
Y
N
L
26 comments
Great idea! Would love to see it!
Try it out and let us know how well it works. If there are specific issues with markdown that it's not doing well we can go in and try to replicate/fix it.
We support Langchain splitters but don't mind writing our own also.
Though im surprised by these implementations, it's a little odd to me that the splitters are not hierarchical in nature, or at least the NodeParser. Like It doesn't matter if it splits on headers, you still need to ensure the nodes it generates chunk below some acceptable level of tokens
So maybe the SimpleNodeParser should have an accompanying HierarchicalNodeParser?
split using one splitter, then attempt to split the nodes again by the next one, so on and so forth
That's a good point. Maybe we can do a better one using this structure you're talking about.
In the meantime, ill hack something together to see where i end up
Keep us updated!
Well first attempt failed, interplay here is very confusing. the llama index stuff looks like it uses the langchain text splitters, but they use two different APIs, one returns Documents the other TextSplits.
Plain Text
def get_text_splits_from_document(
    document: BaseNode,
    text_splitter: TextSplitter,
    include_metadata: bool = True,
) -> List[TextSplit]:
    """Break the document into chunks with additional info."""
    # TODO: clean up since this only exists due to the diff w LangChain's TextSplitter
    if isinstance(text_splitter, TokenTextSplitter):
        # use this to extract extra information about the chunks
        text_splits = text_splitter.split_text_with_overlaps(
            document.get_content(metadata_mode=MetadataMode.NONE),
            metadata_str=document.get_metadata_str() if include_metadata else None,
        )
    else:
        text_chunks = text_splitter.split_text(
            document.get_content(),
        )
        text_splits = [TextSplit(text_chunk=text_chunk) for text_chunk in text_chunks]

    return text_splits


this logic is no good, if you actually implement it where you return a [str], it will work, but if you implement your textSplitter to return [TextSplit]'s so that each split can potentially contain metadata, like what the header stack was for this chunk of text, you can't.

Id' probably modify

Plain Text
text_chunks = text_splitter.split_text(
    document.get_content(),
)
text_splits = [TextSplit(text_chunk=text_chunk) for text_chunk in text_chunks]

so that it doesn't assume it's a string, but actually attempts to handle strings, documents or TextSplits being returned
even though the abstract base function in langchain returns [str] it's not a hard and fast rule, some of theirs return [document]s instead
@Nick Darnell a little lost lol is the issue that langchain splitters don't return strings? Could probably add an isinstance check for langchains base splitter class to handle this nicely?
I guess they must have returned strings at some point in the past, and this just wasn't maintained well and snuck by πŸ˜…
They sometimes return strings
They sometimes return documents
Basically whenever the splitter needs to include metadata
Well that's inconsistant haha
good to know!
@Logan M I kinda wish someone make another library just for writing splitters with the expectations of chaining them. As it stands now, i can generate one into another, but even though the new things are nodes, it expects documents.

A library that Assumed you get N TextSplits, (the first parser getting 1), and then having every parser only consume and output TextSplits would be ideal
Also makes more sense if you have mixed media documents
So maybe you instead of TextSpits, MediaSplits, if you only understand TextSplits you ignore anything else
Definitely open to some contributions around this. We definitely could be putting more thought into how the splitters work
Add a reply
Sign up and join the conversation on Discord