Find answers from the community

Updated 2 years ago

How to customize the tokenizer used for

How to customize the tokenizer used for embedding??
L
z
t
7 comments
Whats the issue here?

You can set the tokenizer in the text-splitter definition
the tokenizer used with the embedding model is gpt2 while the one I want to use is cl100k_base is there a way to set the embedding tokenizer to be cl100k_base instead of gpt2?
It's only used for counting tokens, not for actually sending tokenized data

In any case, you can set with something like this

Plain Text
from llama_index.langchain_helpers.text_splitter import TokenTextSplitter
from llama_index.node_parser.simple import SimpleNodeParser
from llama_index import ServiceContext, GPTSimpleVectorIndex

tokenizer = <tiktoken stuff>
text_splitter = TokenTextSplitter(tokenizer=tokenizer)
node_parser = SimpleNodeParser(text_splitter=text_splitter)
service_context = ServiceContext.from_defaults(.., node_parser=node_parser)
index = GPTSimpleVectorIndex.from_documents(docs, service_context=service_context)
@Logan M doesn't this mean that the token count is wrong when chunking documents ?
According to this openAI cookbook, it's not by many but I suppose it could be slightly off
The default embedder is text-embedding-ada-002 no ? This uses cl100k_base and not gpt2 ... maybe llama_index should use that one by default ?

https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb
Looks like I'm NOT the first one to notice this πŸ˜‰
https://github.com/jerryjliu/llama_index/issues/1205
Yea technically it's off by a tiny bit.

I don't know the full story, but there's a spot in the code where we check for python3.8 vs. Python3.9

If you have pyrhon3.8, we use transformers for the tokenizer

But, transformers only has gpt2, so I thiiiiink gpt2 is the default to keep those two consistent.

I don't actually know if this check is needed though, and we can probably change the default lol just never taken the chance to do it yet
But! There is a new way to count tokens, and changing the tokenizer is very easy with the new way

https://gpt-index.readthedocs.io/en/latest/examples/callbacks/TokenCountingHandler.html
Add a reply
Sign up and join the conversation on Discord