QUESTION that arises from the the

At a glance

QUESTION that arises from the the examples in ServiceContext docs (Key Components > Customization > ServiceContext) about kwargs:

#1
LLM(model=text-davinci-003, max_tokens=256)
SimpleNodeParser(chunk_size=1024, chunk_overlap=20)
PromptHelper(context_window=4096, num_output=256, chunk_overlap_ratio=0.1, chunk_size_limit=None)
no chunk size in ServiceContext

#2
LLM(model=gpt-3.5-turbo, max_tokens not defined)
SimpleNodeParser & PromptHelper not defined
ServiceContext(chunk_size=512)

The confusion:

Both models have the same max token window of 4096 (± 1 token), which is defined in 1) but but in 2), why?
#2 didn't define NodeParsing but i guess ServiceContext(chunk_size=512) passes this over to default node parser which is like doing SimpleNodeParser(chunk_size=512, chunk_overlap=0), am I wrong?
Please help me understand the difference in #1 between LLM(max_tokens=256) & PromptHelper(num_output=256), docs say Number of outputs for the LLM. or set number of output tokens somewhere else, but I dont understand what this means for real. Does this define the length of the final answer?
I already chunked the nodes and only use a saved index from disc, is the splitter in that phase meant for the users input/question pre embeddings?

6 comments

LLau Fla

On the latest version I now tried this but I kinda feel like the "no idea what im doing" dog-meme:

Plain Text

num_output=512
chunk_size = 1024
chunk_overlap = 100


llm_predictor = LLMPredictor(llm=ChatOpenAI(temperature=0.0, model_name='gpt-3.5-turbo-16k', max_tokens=num_output))
embed_model = LangchainEmbedding(OpenAIEmbeddings(model="text-embedding-ada-002"))
splitter = TokenTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
node_parser = SimpleNodeParser(text_splitter=splitter, include_extra_info=False, include_prev_next_rel=True)
prompt_helper = PromptHelper(context_window=16000, num_output=num_output, chunk_overlap_ratio=0.1, chunk_size_limit=None)
service_context = ServiceContext.from_defaults(
    llm_predictor=llm_predictor,
    embed_model=embed_model,
    node_parser=node_parser,
    prompt_helper=prompt_helper)

LLogan M

oh we getting into the weeds on this one haha

LLogan M

point 1 -> the context window is defined in 1 because it needs to know the max input size at query time, to make sure the prompts to the model are not too big. If you don't set it, under the hood it gets set automatically based on the model name (for openai models anyways)

point 2 -> the chunk size gets passed from the service context to the node parser yes. But the node parser uses from_defaults() which sets anything to a default value. If you pass no chunk overlap, the default is 20

point 3 -> num_output in the prompt helper ensures that every prompt sent to the LLM has at least num_output tokens of room (i.e. 4096 - num_output).

LLMs like GPT generate tokens one at a time, add the new token to the input, and generate the next token. It will keep generating tokens until either the max input size is hit, a special stop token is predicted, or the max_tokens is reached. If you set max_tokens on the LLM itself, it will generate up to that amount, but you also have to set num_output to leave room for that, unless you are ok with answers being cut off.

The default num_output is 256, while the default max_tokens is -1 (i.e. generate until it's done or hits the max size)

point 4 -> There are two splitters, as I'm sure youve noticed already. One during document ingestion (i.e. the node parser), and one during query time (i.e. the prompt helper). Node parser turns documents into nodes. Prompt helper fits prompt + nodes + query into a context window for a LLM call

LLogan M

TLDR is, it's definitely confusing, but it also makes sense

LLogan M

Back to your example. it looks mostly fine (no need to set parameters to None though, they will just turn to default values later)

If you are using a 16K window, you can take advantage of that by setting a large chunk size, OR setting a higher top k for a vector index.

The default chunk size (1024) is pretty good for generating good embeddings while keeping token costs low

LLau Fla

Thanks Logan 🙂

Add a reply

Find answers from the community

QUESTION that arises from the the