Tokenizer

At a glance

The community members are discussing the SentenceSplitter class and the tokenizer used under the hood. The main points are:

- The chunk size argument refers to the maximum number of tokens per chunk.

- When passing in a HuggingFace tokenizer, the text is not chunked as expected. Using the SentenceSplitter without a tokenizer works as expected.

- The tokenizer is used for counting tokens and needs to be a callable that takes a string and outputs a list.

- If no tokenizer is provided, the default is the tiktoken GPT-2 tokenizer, but this will be changed to GPT-3.5 in a future update.

- Using the tokenizer associated with the embedding model will result in chunks closer to the max sequence length, but the difference is usually small enough that it doesn't matter too much.

There is no explicitly marked answer in the comments.

CChris S.

I'm trying to better understand the SentenceSplitter class. When setting the chunk size, it appears that arg is referring to the number of max tokens per chunk. What isn't so clear is what tokenizer is being used under the hood? I tried passing in a HF tokenizer to the tokenizer arg, but the output from doing so simply returned the text input without chunking it all. Simply using the SentenceSplitter as is, without passing in any tokenizer works as expected.

8 comments

LLogan M

Yea it's the chunk size in tokens.

It will chunk the text close to the chunk size, while trying to respect sentence counting.

The tokenizer is used for just counting tokens. It needs to be a callable though that takes a string and outputs a list

For huggingface, i would pass in tokenizer=hf_tokenizer.encode

CChris S.

@Logan M that makes total sense now. I was passing in the entire tokenizer as an arg instead of the encode function itself. It works as expected now. So what tokenizer is being used under the hood if a user leaves the tokenizer arg as None?

LLogan M

It defaults to tiktoken gpt2 under the hood

In the future, updates will make this default of llamaindex easier to change 🙏

CChris S.

Good to know. That is not transparent either in the docstring or code base at all. I think it's safe to say though that our resultant chunks will be closer to the max_sequence_length of our embedding model if we pass in the associated tokenizer that the model uses, otherwise we can expect the resultant chunks to always be slightly off, correct?

LLogan M

That's correct. Although tbh, the difference is usually small enough in my experience that it doesn't matter too terribly much

LLogan M

A future update will change the default tokenizer to gpt-3.5 (to match the default LLM) and add a single global setting to change

CChris S.

Agreed. In previous work I was using the NLTK tokenzier becuase it had a reverse tokenize function built in and it gave us a rough approximation to the actual number of tokens to be used and it was fine.

CChris S.

BTW, thanks for your quick responses, this is a bad-ass community.

Add a reply

Find answers from the community

Tokenizer