Find answers from the community

Updated 5 months ago

Nvm nvm it’s not working 😕

At a glance
Nvm nvm it’s not working 😕
L
b
20 comments
it should work in the service context?

What's the issue here?
Depending on the context window size, I would also try reducing the chunk size

Plain Text
llm = LangChainLLM(lc_llm)
service_context = ServiceContext.from_defaults(llm=llm, context_window=2048, chunk_size=512)

set_global_service_context(service_context)
Maybe it’s me, but I’m using a custom Prompt and the beginning (system message) is getting cutoff. Using compact and refine synthesizer
If I have 4096 (llama-2) and I have max output tokens of 1000, should I use 4096 or 3096?
Actually hold up, how is llama index taking context window into account? I’m using text-generation-inference to host my llm, so where is it getting the number of tokens to appropriately chunk?
Still use 4096. Llama index should see that you have max tokens set to 1000 and figure it out.

Weird that the start of the prompt is getting cut off though, I would expect it to be the end?

Systems prompts are a little junky though, still figuring out the best way to integrate them

A simple fix might be to slightly decrease the context window to take into account the system prompt
I don’t have a prompt helper configured so I’m still confused as to how it’s calculating?
It picks up the data from the llm itself
Prompt helper isn't really user facing anymore
And also you can directly set context_window and num_outputs directly in the service context
Maybe just to be sure lol
Sorry, can you elaborate on how it is using text-generation-inference to get the number of tokens?
TGI doesn’t have a tokenizer endpoint, they expect chunking to happen client side
So specifically, for langchain llms, it's defaulting to context_window=3900 (allows for some wiggle room) and num_output=256

If any of these are incorrect, then they need to be adjusted in the service context to match the LLM

Plain Text
llm = LangChainLLM(lc_llm)
service_context = ServiceContext.from_defaults(llm=llm, context_window=3900, num_output=1000)

set_global_service_context(service_context)


For other LLMs officially supported, these numbers can be pulled from the LLM class directly. LangChainLLM is a special case.


Then, using these values, llama-index chunks things appropriately. The one snag is the system prompt, which is not accounted for properly when constructing LLM inputs (long story). So if your system prompt is causing issues, try shortening the context window in the service context
Right, but without access to the correct tokenizer its impossible to tell how to chunk things appropriately
Sorry if I’m being dense, but how does it know/access the llama-2 tokenizer to count to 4096 tokens
right, it doesnt, it will be an approximation. By default it's using a gpt2 tokenizer to calculate this.

Most tokenizers are fairly close to eachother though tbh
Hence, setting the lower context_window is usually advised 🙂
(which is also why the default is 3900)
Ah got it, thanks
Add a reply
Sign up and join the conversation on Discord