Find answers from the community

Updated 3 months ago

Tokenizer

I tried to use custom LLM API and embedding API (not via openai, or huggingface). I implemented custom LLM class and custom embedding class. However, when I tried to use it, I got ConnectionError, which is caused by tiktoken in ChatMemoryBuffer
L
t
E
15 comments
You can pass any function as a tokenizer. The only requirement is that it takes a string and returns a list

It's what is used to count tokens
For example, a very naive/dumb tokenizer (don't use this, just illustrating an example)

Plain Text
def tokenizer(text):
    return text.split(" ")
Thanks for your help. But I don't know how to set the custom tokenizer as the default tokenizer
You can set the tokenizer in the memory buffer

Plain Text
def tokenizer(text):
    return text.split(" ")

memory = ChatMemoryBuffer.from_defaults(..., tokenizer_fn=tokenizer)
I found an issue! I need to import ChatMemoryBuffer before setting the tokenizer. However, as my enviroment cannot connect to openai, But my development environment cannot connect to OpenAI, so I receive an error immediately after importing llama_index.
My current situation is that I have established a LLM service and embedding service in the cloud, and I have also deployed Milvus in the cloud. Now, I want to use my own resources to utilize llama_index. Do you have any suggestions?
What is the error you received?
requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(10054, '远程主机强迫关闭了一个现有的连接。', None, 10054, None))
File "XXX\venv\Lib\site-packages\tiktoken\load.py", line 24, in read_file
resp = requests.get(blobpath)
File "XXX\venv\Lib\site-packages\tiktoken_ext\openai_public.py", line 11, in gpt2
mergeable_ranks = data_gym_to_mergeable_bpe_ranks(
The main error occurs here
It tries to download bpe file and json file from openai.
right -- theres a bunch of places in the code that use tiktoken, it might not be entirely easy to avoid this 😅
Add a reply
Sign up and join the conversation on Discord