Thanks Logan (and @Teemu _, here's how I'm defining my chat engine:
chat_engine = index.as_chat_engine(
chat_mode="context",
retriever_mode="embedding",
similarity_top_k=5,
node_postprocessors=[reranker],
verbose=True,
system_prompt=" ".join((DM_Prompt, History_Prompt, Character_Prompt)),
memory=memory
)
and I'm setting the settings like this:
Settings.llm=llm
Settings.chunk_size=512
Settings.callback_manager=CallbackManager([token_counter])
Settings.chunk_overlap=25
Settings.embed_model=embed_model
Settings.num_output=512
I increased the "num_output" from the default of 256, because I noticed that was what seemed to cause it to overrun before (it was just under the limit for GPT 3.5 and ran over with the output).
I am seeing messages like this sometimes:
Query has been truncated from the right to 256 tokens from 1271 tokens.
so it does seem to be doing some clipping perhaps to keep things under the limit, but I'm not sure what.