Tokenizer

At a glance

The community member tried to use custom LLM API and embedding API, but encountered a ConnectionError caused by tiktoken in ChatMemoryBuffer. Other community members suggested using a custom tokenizer function and setting it in the memory buffer. However, the community member faced an issue where they needed to import ChatMemoryBuffer before setting the tokenizer, but their development environment could not connect to OpenAI. The community member has set up LLM and embedding services in the cloud, as well as Milvus, and is now looking for suggestions on how to utilize llama_index with their own resources.

Useful resources

ttianzhi4599

I tried to use custom LLM API and embedding API (not via openai, or huggingface). I implemented custom LLM class and custom embedding class. However, when I tried to use it, I got ConnectionError, which is caused by tiktoken in ChatMemoryBuffer

15 comments

LLogan M

You can pass any function as a tokenizer. The only requirement is that it takes a string and returns a list

It's what is used to count tokens

LLogan M

For example, a very naive/dumb tokenizer (don't use this, just illustrating an example)

Plain Text

def tokenizer(text):
    return text.split(" ")

ttianzhi4599

Thanks for your help. But I don't know how to set the custom tokenizer as the default tokenizer

EEmanuel Ferreira

@tianzhi4599

May that's what you're looking for?

https://gpt-index.readthedocs.io/en/latest/examples/embeddings/custom_embeddings.html#usage-example

LLogan M

You can set the tokenizer in the memory buffer

Plain Text

def tokenizer(text):
    return text.split(" ")

memory = ChatMemoryBuffer.from_defaults(..., tokenizer_fn=tokenizer)

ttianzhi4599

Thanks a lot!

ttianzhi4599

I found an issue! I need to import ChatMemoryBuffer before setting the tokenizer. However, as my enviroment cannot connect to openai, But my development environment cannot connect to OpenAI, so I receive an error immediately after importing llama_index.

ttianzhi4599

My current situation is that I have established a LLM service and embedding service in the cloud, and I have also deployed Milvus in the cloud. Now, I want to use my own resources to utilize llama_index. Do you have any suggestions?

LLogan M

What is the error you received?

ttianzhi4599

requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(10054, '远程主机强迫关闭了一个现有的连接。', None, 10054, None))

ttianzhi4599

File "XXX\venv\Lib\site-packages\tiktoken\load.py", line 24, in read_file
resp = requests.get(blobpath)

ttianzhi4599

File "XXX\venv\Lib\site-packages\tiktoken_ext\openai_public.py", line 11, in gpt2
mergeable_ranks = data_gym_to_mergeable_bpe_ranks(

ttianzhi4599

The main error occurs here

ttianzhi4599

It tries to download bpe file and json file from openai.

LLogan M

right -- theres a bunch of places in the code that use tiktoken, it might not be entirely easy to avoid this 😅

Add a reply

Find answers from the community

Tokenizer