Find answers from the community

Updated 9 months ago

Chat engine

Hello,
I noticed the chat engine repeats a single .chat(question) a few times before it answers the next question
I have 3 questions, but it keeps looping the first like 7 times before moving to the next, which it doesnt loop
so it pretty weird
Plain Text
response_text = chat_engine.chat(question)
L
B
11 comments
The default chat engine is an agent. Which, if using an open source llm, can be kind of not great

How did you setup the chat engine? What LLM are you using?
I am using VLLM Mistral 7B Instruct v0.1

Chat engine:
Plain Text
chat_engine = index.as_chat_engine(
    chat_mode="context",
    memory=memory,
    system_prompt=(
        "Answer the question as precisely as possible. Answer directly what is being asked."
    ),
)


LLM:
Plain Text
llm = OpenAILike(
    model=model,
    api_base=openai_api_base,
    api_key="fake",
    api_type="fake",
    max_tokens=512,
    temperature=0.1,
    query_wrapper_prompt=PromptTemplate
)
Hmm, that's pretty weird.

There will be one call to embed and retrieve context using your embedding model

Then an LLM call to respond using the context + chat history
I find it hard to believe that could would call the LLM more than once.

The source code makes a single LLM call
Now, constructing the index itself would take many embedding model calls
so what are you saying?
if the code would make a single LLM call, you think the index could influence it?
i think i havent updated llamaindex for a while, i could try that too.
but here is more of my code including the index:
Plain Text
embed_model_name = 'BAAI/bge-small-en-v1.5'
embed_model = HuggingFaceEmbedding(
    model_name=embed_model_name,
    device='cuda',
    normalize='True'
)

service_context = ServiceContext.from_defaults(
    llm=llm,
    embed_model=embed_model,
)

storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex.from_vector_store(vector_store, service_context=service_context, storage_context=storage_context)

from llama_index.memory import ChatMemoryBuffer

memory = ChatMemoryBuffer.from_defaults(token_limit=1500)

i even tried
Plain Text
chat_engine.reset()
before prompting the first question, because i noticed if i dont call the reset after a while it gets buggy, but starting with it didnt change anything (i do end with a reset after the 3 questions to keep it bug free)
hmm it might actually be a vLLM issue now i look at it, im gonna dig a bit deeper, because i notice every 5 seconds there is a new line, and it kinda linked to the repeating
i also noticed another question might do the same thing, giving copies of the output back once done
Attachment
image.png
either that, or the chat engine keeps the communication open with vLLM, repeating the same question and then returning it bundled up once its satisfied?
but i kinda think what you said about opensource llm could also be the case, because when it happens it leaks the page of the pdf etc, while i didnt ask for that and it doesnt include it in the output when this doesnt happen
i asked in the vLLM community, they agree it might be a model issue
Add a reply
Sign up and join the conversation on Discord