I am trying to use llama indexes async

I am trying to use llama indexes async chat engine functions but they are blocking the thread... I am pretty sure this is not supposed to happen. Is this a bug? Or am i using it wrong perhaps?
stream = await self.chat_engine.astream_chat(message_content,self.chat_history)

26 comments

what kind of chat engine are you using?

simple mode

on LLamaCPP

llama-cpp server or locally?

locally

aha

theres no way for that to be non-blocking -- its cpu bound and in process

for non streaming i was using run_in_executor. but my streaming one needs to call an async callback so i can't run in executor

Plain Text

# blocking LLM get message handler
def getmsg(message) -> AGENT_CHAT_RESPONSE_TYPE:
    return chat_sys.chat(message)


# async (non-blocking) LLM message handler
async def getmsga(message) -> AGENT_CHAT_RESPONSE_TYPE:
    return await client.loop.run_in_executor(None,getmsg, message)

that works ^

astream_chat is not implemented in the LLM class
https://github.com/run-llama/llama_index/blob/main/llama_index/llms/llama_cpp.py

If we follow the code...
its just calling stream_chat()
https://github.com/run-llama/llama_index/blob/be96509bbc9981858a0ed0b641ebc39014312512/llama_index/llms/custom.py#L58

but when I try to use the async functions provided by the chat engine they block my thread

oh that would be why then

My advice is the LLM should be on a server locally. You really can't get async without making an API call somewhere

ok so I should somehow run it as an "openai like" api server

Exactly.

Then I'm pretty sure you can just do

Plain Text

from llama_index.llms import OpenAILike

llm = OpenAILike(model="model", api_key="fake", api_base="http://127.0.0.1:8000/v1", ...)

ahh nice ok thanks

I did it like that was working on this today, works great, just hving trouble setting up my prompt my llm keeps mixing user assistant in its replys

What llm are you using? Probably just need to configure completion_to_prompt and messages_to_prompt

Iam running llama.cpp on a seperate machine with the tinyllama latest chat model gguf, iam really new to all this

I would be really grateful for some examples / pointers on configuration you mention 🙂

Plain Text

def completion_to_prompt(completion: str) -> str:
  system_prompt = "..."
  return f"<|system|>\n{system_prompt}</s>\n<|user|>\n{completion}</s>\n<|assistant|>\n"

def messages_to_prompt(messages):
  prompt_str = ""
  for msg in messages:
    if msg.role == "system":
      prompt += f"<|system|>\n{msg.content}</s>\n"
    if msg.role == "user":
      prompt += f"<|user|>\n{msg.content}</s>\n"
    if msg.role == "assistant":
      prompt += f"<|assistant|>\n{msg.content}</s>\n"

  prompt_str += "<|assistant|>\n"
  return prompt_str


llm = OpenAILike(....., completion_to_prompt=completion_to_prompt, messages_to_prompt=messages_to_prompt)

something like that, a little untested

got the format from the model card https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0

I have had that happen too. Not sure what caused it. Maybe the memory window moving?