The community members are discussing the SentenceSplitter class and the tokenizer used under the hood. The main points are:
- The chunk size argument refers to the maximum number of tokens per chunk.
- When passing in a HuggingFace tokenizer, the text is not chunked as expected. Using the SentenceSplitter without a tokenizer works as expected.
- The tokenizer is used for counting tokens and needs to be a callable that takes a string and outputs a list.
- If no tokenizer is provided, the default is the tiktoken GPT-2 tokenizer, but this will be changed to GPT-3.5 in a future update.
- Using the tokenizer associated with the embedding model will result in chunks closer to the max sequence length, but the difference is usually small enough that it doesn't matter too much.
There is no explicitly marked answer in the comments.
I'm trying to better understand the SentenceSplitter class. When setting the chunk size, it appears that arg is referring to the number of max tokens per chunk. What isn't so clear is what tokenizer is being used under the hood? I tried passing in a HF tokenizer to the tokenizer arg, but the output from doing so simply returned the text input without chunking it all. Simply using the SentenceSplitter as is, without passing in any tokenizer works as expected.
@Logan M that makes total sense now. I was passing in the entire tokenizer as an arg instead of the encode function itself. It works as expected now. So what tokenizer is being used under the hood if a user leaves the tokenizer arg as None?
Good to know. That is not transparent either in the docstring or code base at all. I think it's safe to say though that our resultant chunks will be closer to the max_sequence_length of our embedding model if we pass in the associated tokenizer that the model uses, otherwise we can expect the resultant chunks to always be slightly off, correct?
Agreed. In previous work I was using the NLTK tokenzier becuase it had a reverse tokenize function built in and it gave us a rough approximation to the actual number of tokens to be used and it was fine.