Find answers from the community

Updated 4 months ago

Hey all, I'm trying to use a custom open

At a glance

The community member is trying to use a custom open-source LLM (Mixtral-8x7B-Instruct-v0.1) via LiteLLM, but is encountering an issue where they get a "ValueError: Initial token count exceeds token limit" when trying to use the Chat_Engine. They were previously using GPT 3.5 turbo 16k via OpenAI without any issues.

The comments suggest that the issue might be related to the settings, such as the chunk size and top k, causing the chat engine to retrieve too much context. The community members have provided their initialization code, which includes setting the chunk size to 768 and the top k to 10, resulting in 7680 tokens being retrieved and inserted into the system prompt on every user message.

The community members have also discussed the possibility of manually overriding the context window limit, as the LiteLLM library may not have the context window information for the Mixtral model. They have suggested using a custom memory buffer with a higher token limit to resolve the issue.

Useful resources
Hey all, I'm trying to use a custom open source LLM via LiteLLM. I can get it to work with basic sample code ,but when I try to load it into the Chat_Engine, everything loads fine, but when I do a simple brief chat, with no system prompt, I always get "ValueError: Initial token count exceeds token limit". The only code I changed from my working original code was OpenAI to : "llm = LiteLLM(model="together_ai/mistralai/Mixtral-8x7B-Instruct-v0.1", temperature = 0)" do I have to set the context length manually or something?
L
D
10 comments
it should be getting set automatically, judging by the code for the LLM
What kind of chat engine are you using?
sounds like the settings might be causing the chat engine to retrieve too much context -- did you set a large chunk size or top k?
what version of llama-index?
Version is 0.9.14.post3. I was previous using GPT 3.5 turbo 16k via the OpenAI function, now switching to LiteLLM so that I can try out Mixtral (32K) via Together AI: Here's my initialization code:

Plain Text
llm = LiteLLM(model="together_ai/mistralai/Mixtral-8x7B-Instruct-v0.1", temperature = 0)
service_context = ServiceContext.from_defaults(
        llm=llm, 
        chunk_size=768, 
        callback_manager=callback_manager, 
        chunk_overlap=75, 
        embed_model=embed_model
    )
index = load_embeddings(service_context, storage_folder, game_system_name)
reranker = CohereRerank(api_key=api_key, top_n=50)
memory = ChatMemoryBuffer.from_defaults()

    chat_engine = index.as_chat_engine(
        chat_mode='context', retriever_mode="embedding", similarity_top_k=10,
        node_postprocessors=[reranker],
        verbose=True,
        system_prompt=""#" ".join((DM_Prompt, History_Prompt, Character_Prompt, Story_Prompt)),
    )
Was running ok on OpenAI, so would be surprised if it all of a sudden ran over double the context window. How does llama-index know the context window of a random model through litellm that's super new?
No idea, I'm just looking at this function.

https://github.com/run-llama/llama_index/blob/3b892b3e82c140244d2b25a12ae35f91120a2264/llama_index/llms/litellm_utils.py#L111

Seems built into litlellm

You've set a chunk size to 768, and the top k to 10. This means 7680 tokens are retrieved and inserted into the system prompt on every user message.

By default, the memory is inisitialized based on the llm context window
https://github.com/run-llama/llama_index/blob/3b892b3e82c140244d2b25a12ae35f91120a2264/llama_index/chat_engine/context.py#L75

Just pass in a new memory that overrides this limit. Then it should be fine

Plain Text
from llama_index.memory import ChatMemoryBuffer

memory = ChatMemoryBuffer.from_defaults(token_limit=32000)
chat_engine = index.as_chat_engine(..., memory=memory)
(I think mixtral is 32k? Not sure)
@Logan M Ah ok, thanks, yeah that seems to be the issue:

Plain Text
try:
        context_size = int(litellm.get_max_tokens(modelname))
    except Exception:
        context_size = 2048  # by default assume models have at least 2048 tokens

When I ran that directly I got:

Plain Text
Exception: This model isn't mapped yet. Add it here - https://github.com/BerriAI/litellm/blob/main/model_prices_and_context_window.json


And mixtral isn't in there yet. I'll just manually override it as you mentioned at this point and give that a try, thanks!
Litellm seems to handle any model and just passes through the model name in the API, but these things like context window seem to need to be manually added.
Add a reply
Sign up and join the conversation on Discord