hey is there any additional information

At a glance

The community members are discussing the MetadataAwareTextSplitter class from the LlamaIndex library. It is a base class that is meant to be extended, with the SentenceSplitter and TokenTextSplitter being subclasses. The purpose of this class is to split text while considering the metadata associated with it, as the metadata needs to be included when sending text to the language model. The SentenceSplitter is used in the IngestionPipeline, and the community members discuss how the "would-be" length of the metadata is included when splitting the initial text, so that the length of the chunk plus the length of the metadata does not exceed the chunk size when sending to the language model.

Useful resources

nniid

hey is there any additional information on https://docs.llamaindex.ai/en/stable/api/llama_index.node_parser.MetadataAwareTextSplitter.html ? What is it indended for? How does it work?

7 comments

LLogan M

It's a base class that is meant to be extended. The SentenceSplitter and TokenTextSplitter are both subclasses of this

LLogan M

Since metadata is included when sending text to the LLM, the text needs to be split with that metadata considered

LLogan M

That class makes it a little easier when implementing new text splitters

nniid

I am wondering in particular about SentenceSplitter.. is it used when sending text to the LLM?

nniid

I only know it from using it in IngestionPipeline

nniid

ah ok now I get. When the initial text is being split the "would-be" length of the metadata is included. So when sending to LLM in response synthesizer, the len(chunk) + len(metadata) <= chunk_size

LLogan M

you got it!

Add a reply

Find answers from the community

hey is there any additional information