Hey all, I'm trying to use a custom open

At a glance

The community member is trying to use a custom open-source LLM (Mixtral-8x7B-Instruct-v0.1) via LiteLLM, but is encountering an issue where they get a "ValueError: Initial token count exceeds token limit" when trying to use the Chat_Engine. They were previously using GPT 3.5 turbo 16k via OpenAI without any issues.

The comments suggest that the issue might be related to the settings, such as the chunk size and top k, causing the chat engine to retrieve too much context. The community members have provided their initialization code, which includes setting the chunk size to 768 and the top k to 10, resulting in 7680 tokens being retrieved and inserted into the system prompt on every user message.

The community members have also discussed the possibility of manually overriding the context window limit, as the LiteLLM library may not have the context window information for the Mixtral model. They have suggested using a custom memory buffer with a higher token limit to resolve the issue.

Useful resources

DDarthus

Hey all, I'm trying to use a custom open source LLM via LiteLLM. I can get it to work with basic sample code ,but when I try to load it into the Chat_Engine, everything loads fine, but when I do a simple brief chat, with no system prompt, I always get "ValueError: Initial token count exceeds token limit". The only code I changed from my working original code was OpenAI to : "llm = LiteLLM(model="together_ai/mistralai/Mixtral-8x7B-Instruct-v0.1", temperature = 0)" do I have to set the context length manually or something?

10 comments

LLogan M

it should be getting set automatically, judging by the code for the LLM

LLogan M

What kind of chat engine are you using?

LLogan M

sounds like the settings might be causing the chat engine to retrieve too much context -- did you set a large chunk size or top k?

LLogan M

what version of llama-index?

DDarthus

Version is 0.9.14.post3. I was previous using GPT 3.5 turbo 16k via the OpenAI function, now switching to LiteLLM so that I can try out Mixtral (32K) via Together AI: Here's my initialization code:

Plain Text

llm = LiteLLM(model="together_ai/mistralai/Mixtral-8x7B-Instruct-v0.1", temperature = 0)
service_context = ServiceContext.from_defaults(
        llm=llm, 
        chunk_size=768, 
        callback_manager=callback_manager, 
        chunk_overlap=75, 
        embed_model=embed_model
    )
index = load_embeddings(service_context, storage_folder, game_system_name)
reranker = CohereRerank(api_key=api_key, top_n=50)
memory = ChatMemoryBuffer.from_defaults()

    chat_engine = index.as_chat_engine(
        chat_mode='context', retriever_mode="embedding", similarity_top_k=10,
        node_postprocessors=[reranker],
        verbose=True,
        system_prompt=""#" ".join((DM_Prompt, History_Prompt, Character_Prompt, Story_Prompt)),
    )

DDarthus

Was running ok on OpenAI, so would be surprised if it all of a sudden ran over double the context window. How does llama-index know the context window of a random model through litellm that's super new?

LLogan M

No idea, I'm just looking at this function.

https://github.com/run-llama/llama_index/blob/3b892b3e82c140244d2b25a12ae35f91120a2264/llama_index/llms/litellm_utils.py#L111

Seems built into litlellm

You've set a chunk size to 768, and the top k to 10. This means 7680 tokens are retrieved and inserted into the system prompt on every user message.

By default, the memory is inisitialized based on the llm context window
https://github.com/run-llama/llama_index/blob/3b892b3e82c140244d2b25a12ae35f91120a2264/llama_index/chat_engine/context.py#L75

Just pass in a new memory that overrides this limit. Then it should be fine

Plain Text

from llama_index.memory import ChatMemoryBuffer

memory = ChatMemoryBuffer.from_defaults(token_limit=32000)
chat_engine = index.as_chat_engine(..., memory=memory)

LLogan M

(I think mixtral is 32k? Not sure)

DDarthus

@Logan M Ah ok, thanks, yeah that seems to be the issue:

Plain Text

try:
        context_size = int(litellm.get_max_tokens(modelname))
    except Exception:
        context_size = 2048  # by default assume models have at least 2048 tokens

When I ran that directly I got:

Plain Text

Exception: This model isn't mapped yet. Add it here - https://github.com/BerriAI/litellm/blob/main/model_prices_and_context_window.json

And mixtral isn't in there yet. I'll just manually override it as you mentioned at this point and give that a try, thanks!

DDarthus

Litellm seems to handle any model and just passes through the model name in the API, but these things like context window seem to need to be manually added.

Add a reply

Find answers from the community

Hey all, I'm trying to use a custom open