# Trying huggingface Llama-s chat instead to see if inference speeds up! DEFAULT_TEXT_QA_PROMPT_TMPL = ( "Context information is below.\n" "---------------------\n" "{context_str}\n" "---------------------\n" "Given the context information and not prior knowledge, " "answer the query.\n" "Query: {query_str}\n" "Answer: " ) query_wrapper_prompt = PromptTemplate( DEFAULT_TEXT_QA_PROMPT_TMPL, prompt_type=PromptType.QUESTION_ANSWER )
llm = HuggingFaceLLM( context_window=4096, max_new_tokens=256, generate_kwargs={"temperature": 0.1, "do_sample": False}, query_wrapper_prompt=query_wrapper_prompt, tokenizer_name="meta-llama/Llama-2-7b-chat-hf", model_name="meta-llama/Llama-2-7b-chat-hf", device_map="auto", tokenizer_kwargs={"max_length": 512}, # uncomment this if using CUDA to reduce memory usage model_kwargs={"torch_dtype": torch.float16} )
The query_wrapper_prompt is only a wrapper around the entire internal prompt that llama-index is using. This is provided so that you have an easy way to format the prompt
For example, for llama2, it might look something like "[INST] {query_str} [/INST] "
Thanks @Logan M ! how do I include the {context_str} in this case? Since mine is a RAG Q&A use case, I want to provide them LLM with both context_str and query_str.