Find answers from the community

Updated 6 months ago

Empty Response

At a glance

The community member is running LlamaIndex 0.8.62 in Google Colab and using LlamaCPP to load the LLama 2 13B chat model. After the first few successful query calls using VectorStoreIndex as the query engine, the responses become "Empty Response". The community member has tried using node postprocessing techniques like SentenceEmbeddingOptimizer and SentenceTransformerRerank, but the issue persists.

In the comments, another community member suggests trying to lower the context_window parameter to 3700, as the token counting might be inaccurate. Another community member notes that the issue was resolved by downgrading the llama-cpp-python version from 0.2.14 to 0.2.11.

There is no explicitly marked answer in the comments.

Useful resources
Hi, currently I am running LlamaIndex 0.8.62 in google colab. I used LlamaCPP to load LLama 2 13B chat (and other models in GGUF file). After the first successful couple of query calls using VectorStoreIndex as query engine, the responses I get after that are always "Empty Response". Plus, I have experimented with and without node postprocessing: SentenceEmbeddingOptimizer and SentenceTransformerRerank. So how can I solve that problem?

P/s: my temporary solution now is checking if response == "Empty Response", if True then re-run query_engine.query(question) because it always returns "Empty Response" for the first time
N
L
4 comments
@Logan M could you help me to check? this is my code:

model_url = "https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF/resolve/main/llama-2-13b-chat.Q4_K_M.gguf" llm = LlamaCPP( # You can pass in the URL to a GGML model to download it automatically model_url=model_url, # optionally, you can set the path to a pre-downloaded model instead of model_url model_path=None, temperature=0, max_new_tokens=512, # llama2 has a context window of 4096 tokens, but we set it lower to allow for some wiggle room context_window=3900, # kwargs to pass to __call__() generate_kwargs={"top_k": 50, "top_p": 0.95}, # kwargs to pass to __init__() # set to at least 1 to use GPU model_kwargs={"n_gpu_layers": 40}, # transform inputs into Llama2 format messages_to_prompt=messages_to_prompt, completion_to_prompt=completion_to_prompt, verbose=True, ) embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5") node_parser = SimpleNodeParser.from_defaults(chunk_size=512) # create a service context service_context = ServiceContext.from_defaults( llm=llm, embed_model=embed_model, node_parser=node_parser ) # set up query engine index = VectorStoreIndex.from_documents(documents, service_context=service_context) query_engine = index.as_query_engine(streaming=None, text_qa_template=text_qa_template, similarity_top_k=3, response_mode = "compact",) response = query_engine.query(f"{question}") display_response(response)
I have no idea πŸ˜… Maybe try lowering context_window a bit, to maybe 3700? The token counting might be a little innacurate
I guess it is because of the lastest llama-cpp-python version 0.2.14. I downgraded it to version 0.2.11 and it works again.
Interesting πŸ€”
Add a reply
Sign up and join the conversation on Discord