How to customize the tokenizer used for

At a glance

The post asks how to customize the tokenizer used for embedding. Community members discuss that the default tokenizer used is GPT2, while they want to use the cl100k_base tokenizer instead. One community member provides a code snippet to set the tokenizer in the text-splitter definition. Another community member notes that the token count may be slightly off when chunking documents, and suggests that the llama_index library should use the text-embedding-ada-002 model by default, which uses the cl100k_base tokenizer. The community members also discuss the reasons behind the default tokenizer choice and a new way to handle token counting.

Useful resources

zzainab

How to customize the tokenizer used for embedding??

7 comments

LLogan M

Whats the issue here?

You can set the tokenizer in the text-splitter definition

zzainab

the tokenizer used with the embedding model is gpt2 while the one I want to use is cl100k_base is there a way to set the embedding tokenizer to be cl100k_base instead of gpt2?

LLogan M

It's only used for counting tokens, not for actually sending tokenized data

In any case, you can set with something like this

Plain Text

from llama_index.langchain_helpers.text_splitter import TokenTextSplitter
from llama_index.node_parser.simple import SimpleNodeParser
from llama_index import ServiceContext, GPTSimpleVectorIndex

tokenizer = <tiktoken stuff>
text_splitter = TokenTextSplitter(tokenizer=tokenizer)
node_parser = SimpleNodeParser(text_splitter=text_splitter)
service_context = ServiceContext.from_defaults(.., node_parser=node_parser)
index = GPTSimpleVectorIndex.from_documents(docs, service_context=service_context)

ttilleul

@Logan M doesn't this mean that the token count is wrong when chunking documents ?
According to this openAI cookbook, it's not by many but I suppose it could be slightly off
The default embedder is text-embedding-ada-002 no ? This uses cl100k_base and not gpt2 ... maybe llama_index should use that one by default ?

https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb

ttilleul

Looks like I'm NOT the first one to notice this 😉
https://github.com/jerryjliu/llama_index/issues/1205

LLogan M

Yea technically it's off by a tiny bit.

I don't know the full story, but there's a spot in the code where we check for python3.8 vs. Python3.9

If you have pyrhon3.8, we use transformers for the tokenizer

But, transformers only has gpt2, so I thiiiiink gpt2 is the default to keep those two consistent.

I don't actually know if this check is needed though, and we can probably change the default lol just never taken the chance to do it yet

LLogan M

But! There is a new way to count tokens, and changing the tokenizer is very easy with the new way

https://gpt-index.readthedocs.io/en/latest/examples/callbacks/TokenCountingHandler.html

Add a reply

Find answers from the community

How to customize the tokenizer used for