Empty Response

At a glance

The community member is running LlamaIndex 0.8.62 in Google Colab and using LlamaCPP to load the LLama 2 13B chat model. After the first few successful query calls using VectorStoreIndex as the query engine, the responses become "Empty Response". The community member has tried using node postprocessing techniques like SentenceEmbeddingOptimizer and SentenceTransformerRerank, but the issue persists.

In the comments, another community member suggests trying to lower the context_window parameter to 3700, as the token counting might be inaccurate. Another community member notes that the issue was resolved by downgrading the llama-cpp-python version from 0.2.14 to 0.2.11.

There is no explicitly marked answer in the comments.

Useful resources

NNam

Hi, currently I am running LlamaIndex 0.8.62 in google colab. I used LlamaCPP to load LLama 2 13B chat (and other models in GGUF file). After the first successful couple of query calls using VectorStoreIndex as query engine, the responses I get after that are always "Empty Response". Plus, I have experimented with and without node postprocessing: SentenceEmbeddingOptimizer and SentenceTransformerRerank. So how can I solve that problem?

P/s: my temporary solution now is checking if response == "Empty Response", if True then re-run query_engine.query(question) because it always returns "Empty Response" for the first time

4 comments

NNam

@Logan M could you help me to check? this is my code:


model_url = "https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF/resolve/main/llama-2-13b-chat.Q4_K_M.gguf"
llm = LlamaCPP(
    # You can pass in the URL to a GGML model to download it automatically
    model_url=model_url,
    # optionally, you can set the path to a pre-downloaded model instead of model_url
    model_path=None,
    temperature=0,
    max_new_tokens=512,
    # llama2 has a context window of 4096 tokens, but we set it lower to allow for some wiggle room
    context_window=3900,
    # kwargs to pass to __call__()
    generate_kwargs={"top_k": 50, "top_p": 0.95},
    # kwargs to pass to __init__()
    # set to at least 1 to use GPU
    model_kwargs={"n_gpu_layers": 40},
    # transform inputs into Llama2 format
    messages_to_prompt=messages_to_prompt,
    completion_to_prompt=completion_to_prompt,
    verbose=True,
)
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
node_parser = SimpleNodeParser.from_defaults(chunk_size=512)
# create a service context
service_context = ServiceContext.from_defaults(
    llm=llm,
    embed_model=embed_model,
    node_parser=node_parser
)
# set up query engine
index = VectorStoreIndex.from_documents(documents, service_context=service_context)

query_engine = index.as_query_engine(streaming=None,
                                     text_qa_template=text_qa_template,
                                     similarity_top_k=3,
                                     response_mode = "compact",)
response = query_engine.query(f"{question}")
display_response(response)

LLogan M

I have no idea 😅 Maybe try lowering context_window a bit, to maybe 3700? The token counting might be a little innacurate

NNam

I guess it is because of the lastest llama-cpp-python version 0.2.14. I downgraded it to version 0.2.11 and it works again.

LLogan M

Interesting 🤔

Add a reply

Find answers from the community

Empty Response