gpt-3.5-turbo-0125 is available with a 16K context window, and comes with that nice price decrease
(I'm assuming this is in a context chat engine?)
Yeah but for now that one is not out yet right?
(You might have to update llama-index)
I thought they said two weeks
Ahh we were using gpt-3.5-turbo-1106 before which has the same context window so this problem will remain\
Either way, is it possible to limit the amount of tokens sent to openai?
Are you using a context chat engine? Or what's the setup right now?
def initialize_chat_engine(index: VectorStoreIndex, document_uuid: str) -> BaseChatEngine:
filters = MetadataFilters(
filters=[ExactMatchFilter(key="doc_id", value=document_uuid)],
)
return index.as_chat_engine(
chat_mode=ChatMode.CONTEXT,
condense_question_prompt=PromptTemplate(CHAT_PROMPT_TEMPLATE),
chat_history=chat_history,
agent_chat_response_mode="StreamingAgentChatResponse",
similarity_top_k=20,
filters=filters,
)
(top_k = 20 is the problem)
Yeaaaaa
Tbh the only option is lowering the top k
My advice would be to keep the top k high, and maybe add reranking to filter it down?
Ok! We are working on a solution now to detect very broad summary queries that forced us to use top_k of 20 in the first place and add relevant document keywords to the query instead of using such a high top k π
Because if i am correct a lower top_k with a more detailed query that includes keywords to include is probably better, right?
Maybe. Hard to say. Reranking is quite powerful as well
Is that something that's easy to implement?
As far as I understand it. Top k only refers to the amount of chunks that are similar to the input query that are added to the OpenAI call in our case. If a user says something like "Summarize this for me", we will fetch similar chunks to that specific query, which might not be good since the keywords of that query do not say which parts of the document are relevant to the query.....
It's as easy as inputting node_postprocessors=[reranker]
in the kwargs for the query engine
I agree. Something like this is what an agent or router is good for (I.e. to I route to a vector index for top-k, or a summary index/custom query engine that fetches everything)
Damn so many options to choose from. We dont even use a query engine but do a chat engine directly from a vector store index