you can set a token_limit on the memory, but you also need to be careful about what the top-k is on your retriever (too much context will also cause issues)
memory=ChatMemoryBuffer.from_defaults(token_limit=4096) if st.session_state.messages[-1]["role"] != "assistant": with st.chat_message("assistant"): with st.spinner("Thinking..."): llm = OpenAI(model="gpt-3.5-turbo") service_context = ServiceContext.from_defaults(llm=llm) query_engine=RetrieverQueryEngine.from_args(retriever=hybrid_retriever,service_context=service_context) chat_engine=CondensePlusContextChatEngine.from_defaults(query_engine,memory=memory,system_prompt=context_prompt) response , chat_messages = chat_engine.chat(str(prompt)) if "not mentioned in" in response.response or "I don't know" in re…
so the token limit of 4096 in memory=ChatMemoryBuffer.from_defaults(token_limit=4096) will be the limit to make sure previous QA do not exceed 4096 tokens