Find answers from the community

Updated 11 months ago

Opemai

Is there a way to make sure that Llama index does not send a message to ChatGPT that is bigger than the context window allowed for a specific LLM?

For example, we have some queries that have a top_k_similarity thats high which results in a lot of data as context. This works fine for GPT-4 but exceeds the GPT-3-turbo context window.
L
N
24 comments
gpt-3.5-turbo-0125 is available with a 16K context window, and comes with that nice price decrease

(I'm assuming this is in a context chat engine?)
Yeah but for now that one is not out yet right?
It's out actually
(You might have to update llama-index)
I thought they said two weeks
Okay thats promising
Ahh we were using gpt-3.5-turbo-1106 before which has the same context window so this problem will remain\
Either way, is it possible to limit the amount of tokens sent to openai?
Are you using a context chat engine? Or what's the setup right now?
def initialize_chat_engine(index: VectorStoreIndex, document_uuid: str) -> BaseChatEngine:
filters = MetadataFilters(
filters=[ExactMatchFilter(key="doc_id", value=document_uuid)],
)

return index.as_chat_engine(
chat_mode=ChatMode.CONTEXT,
condense_question_prompt=PromptTemplate(CHAT_PROMPT_TEMPLATE),
chat_history=chat_history,
agent_chat_response_mode="StreamingAgentChatResponse",
similarity_top_k=20,
filters=filters,
)
(top_k = 20 is the problem)
Yeaaaaa

Tbh the only option is lowering the top k

My advice would be to keep the top k high, and maybe add reranking to filter it down?
Ok! We are working on a solution now to detect very broad summary queries that forced us to use top_k of 20 in the first place and add relevant document keywords to the query instead of using such a high top k πŸ™‚
Because if i am correct a lower top_k with a more detailed query that includes keywords to include is probably better, right?
Maybe. Hard to say. Reranking is quite powerful as well
Is that something that's easy to implement?
As far as I understand it. Top k only refers to the amount of chunks that are similar to the input query that are added to the OpenAI call in our case. If a user says something like "Summarize this for me", we will fetch similar chunks to that specific query, which might not be good since the keywords of that query do not say which parts of the document are relevant to the query.....
It's as easy as inputting node_postprocessors=[reranker] in the kwargs for the query engine
I agree. Something like this is what an agent or router is good for (I.e. to I route to a vector index for top-k, or a summary index/custom query engine that fetches everything)
Damn so many options to choose from. We dont even use a query engine but do a chat engine directly from a vector store index
Add a reply
Sign up and join the conversation on Discord