Chat engine

At a glance

The community member is experiencing an issue with their chat engine, where it repeats the first question multiple times before moving on to the next question. They are using a VLLM Mistral 7B Instruct model and an OpenAILike LLM. The community members discuss potential causes, such as the default chat engine being an agent, issues with open-source LLMs, and the possibility of the index influencing the LLM calls. They examine the source code and provide suggestions, but there is no explicitly marked answer. The community members agree that the issue might be related to the VLLM model, and the community member plans to investigate further.

Useful resources

BBanaanBakje

Hello,
I noticed the chat engine repeats a single .chat(question) a few times before it answers the next question
I have 3 questions, but it keeps looping the first like 7 times before moving to the next, which it doesnt loop
so it pretty weird

Plain Text

response_text = chat_engine.chat(question)

11 comments

LLogan M

The default chat engine is an agent. Which, if using an open source llm, can be kind of not great

How did you setup the chat engine? What LLM are you using?

BBanaanBakje

I am using VLLM Mistral 7B Instruct v0.1

Chat engine:

Plain Text

chat_engine = index.as_chat_engine(
    chat_mode="context",
    memory=memory,
    system_prompt=(
        "Answer the question as precisely as possible. Answer directly what is being asked."
    ),
)

LLM:

Plain Text

llm = OpenAILike(
    model=model,
    api_base=openai_api_base,
    api_key="fake",
    api_type="fake",
    max_tokens=512,
    temperature=0.1,
    query_wrapper_prompt=PromptTemplate
)

LLogan M

Hmm, that's pretty weird.

There will be one call to embed and retrieve context using your embedding model

Then an LLM call to respond using the context + chat history

LLogan M

I find it hard to believe that could would call the LLM more than once.

The source code makes a single LLM call

LLogan M

You can see the source code here
https://github.com/run-llama/llama_index/blob/9163067027ea8222e9fe5bffff9a2fac26b57686/llama-index-core/llama_index/core/chat_engine/context.py#L172

LLogan M

Now, constructing the index itself would take many embedding model calls

BBanaanBakje

so what are you saying?
if the code would make a single LLM call, you think the index could influence it?
i think i havent updated llamaindex for a while, i could try that too.
but here is more of my code including the index:

Plain Text

embed_model_name = 'BAAI/bge-small-en-v1.5'
embed_model = HuggingFaceEmbedding(
    model_name=embed_model_name,
    device='cuda',
    normalize='True'
)

service_context = ServiceContext.from_defaults(
    llm=llm,
    embed_model=embed_model,
)

storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex.from_vector_store(vector_store, service_context=service_context, storage_context=storage_context)

from llama_index.memory import ChatMemoryBuffer

memory = ChatMemoryBuffer.from_defaults(token_limit=1500)

i even tried

Plain Text

chat_engine.reset()

before prompting the first question, because i noticed if i dont call the reset after a while it gets buggy, but starting with it didnt change anything (i do end with a reset after the 3 questions to keep it bug free)

BBanaanBakje

hmm it might actually be a vLLM issue now i look at it, im gonna dig a bit deeper, because i notice every 5 seconds there is a new line, and it kinda linked to the repeating
i also noticed another question might do the same thing, giving copies of the output back once done

Attachment

BBanaanBakje

either that, or the chat engine keeps the communication open with vLLM, repeating the same question and then returning it bundled up once its satisfied?

BBanaanBakje

but i kinda think what you said about opensource llm could also be the case, because when it happens it leaks the page of the pdf etc, while i didnt ask for that and it doesnt include it in the output when this doesnt happen

BBanaanBakje

i asked in the vLLM community, they agree it might be a model issue

Add a reply

Find answers from the community

Chat engine