Hi guys if any one can help me this would be great. query_engine = index.as_query_engine( similarity_top_k=1, retriever_mode='embedding', response_mode='compact', text_qa_template= QA_PROMPT, service_context=service_context, verbose = True ) here i am setting the response mode to compact but still the query_engine is using the create and refine method. Can anyone help, please. PS: The context is less than 200 tokens so the context window is not fully used. ( I mentioned this because i read in the documentation that if the chunk can't fit the context then it will use the create and refine prompt method, but that is not the case here.)
compact is just an extension of create+refine. The only difference is that it tries to stuff as much text from the retrieved nodes into each LLM call as possible
Is it still hitting the refine proces (i.e. a second LLM call) ? How do you know? How did you check the context size?
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, chunk_size=chunk_size_limit, num_output=num_outputs, context_window=context_window, chunk_overlap=chunk_overlap ) i tried this but the embeddings did not get created
Nothing wrong with using a prompt helper, but there's two chunk sizes in llama index, one at query time (prompt helper), and one at data ingestion time (I.e. in the node parser)
Running with default settings should not trigger the refine process (except in some edge cases with non-english languages or data that doesnt use many spaces)
If you want to lower the chunk size though, you can do it in the service context (but only the chunk size can be set this way)