Find answers from the community

Updated 3 months ago

Hi all can people who have experience

Hi all, can people who have experience with llamaindex help me figure out how you guys decide on the "chunk_size" and "chunk_overlap" field values? and similarly the "context_window" and "num_output" fields in prompt_helper !?
L
r
10 comments
the defaults for chunk size and chunk overlap (1024 and 20) are pretty good, I wouldn't touch them tbh unless you had a good reason to

context window gets set automatically if you are using an openai LLM. If not, it's just the max input size to your LLM

num_output is how many tokens the LLM will output. By default, openai has a max of 256. You can read more on how to change this here: https://gpt-index.readthedocs.io/en/latest/core_modules/model_modules/llms/usage_custom.html#example-explicitly-configure-context-window-and-num-output
Thanks! Few followup questions:- I wouldn't touch them tbh unless you had a good reason to - what are examples of some good reasons Logan?
I know how to change num_output but I don't understand what it signifies, the documentation says- "The number of maximum output from the LLM. Typically we set this automatically given the model metadata. This parameter does not actually limit the model output, it affects the amount of “space” we save for the output, when computing available context window size for packing text from retrieved Nodes." - Why are we saving space for the output? How can the output from the LLM for a query affect the context window - doesn't context window mean setting the space for the past records?
If you changed to some custom LLM that has a smaller/larger context window, then this likely needs to be set manually for things to work smoothly. I can't really think if a good reason to change chunk overlap besides just experimenting lol
So here's a longer explanation:
LLMs are almost entirely decoder models. What this means is that they generate one token at a time, add that token to the input, and generate the next

This means that if we want to generate at most 256 tokens, then we have to leave room in the input for 256 tokens.

The only models that break this pattern are Google FLAN-T5 models, which are encoder-decoder instead of decoder. In this case, the input/output are completeley disconnected (although this architecture doesn't seem to be catching on for LLMs -- it's very resource heavy)
Wow! Got it! I had no idea!
You are god sent Logan!
Happy to help! 💪 :dotsCATJAM:
Add a reply
Sign up and join the conversation on Discord