Find answers from the community

Updated 2 months ago

I am trying to use llama indexes async

I am trying to use llama indexes async chat engine functions but they are blocking the thread... I am pretty sure this is not supposed to happen. Is this a bug? Or am i using it wrong perhaps?
stream = await self.chat_engine.astream_chat(message_content,self.chat_history)
L
s
h
26 comments
what kind of chat engine are you using?
llama-cpp server or locally?
theres no way for that to be non-blocking -- its cpu bound and in process
for non streaming i was using run_in_executor. but my streaming one needs to call an async callback so i can't run in executor
Plain Text
# blocking LLM get message handler
def getmsg(message) -> AGENT_CHAT_RESPONSE_TYPE:
    return chat_sys.chat(message)


# async (non-blocking) LLM message handler
async def getmsga(message) -> AGENT_CHAT_RESPONSE_TYPE:
    return await client.loop.run_in_executor(None,getmsg, message)
but when I try to use the async functions provided by the chat engine they block my thread
oh that would be why then
My advice is the LLM should be on a server locally. You really can't get async without making an API call somewhere
ok so I should somehow run it as an "openai like" api server
Exactly.

Then I'm pretty sure you can just do

Plain Text
from llama_index.llms import OpenAILike

llm = OpenAILike(model="model", api_key="fake", api_base="http://127.0.0.1:8000/v1", ...)
ahh nice ok thanks
I did it like that was working on this today, works great, just hving trouble setting up my prompt my llm keeps mixing user assistant in its replys
What llm are you using? Probably just need to configure completion_to_prompt and messages_to_prompt
Iam running llama.cpp on a seperate machine with the tinyllama latest chat model gguf, iam really new to all this
I would be really grateful for some examples / pointers on configuration you mention πŸ™‚
Plain Text
def completion_to_prompt(completion: str) -> str:
  system_prompt = "..."
  return f"<|system|>\n{system_prompt}</s>\n<|user|>\n{completion}</s>\n<|assistant|>\n"

def messages_to_prompt(messages):
  prompt_str = ""
  for msg in messages:
    if msg.role == "system":
      prompt += f"<|system|>\n{msg.content}</s>\n"
    if msg.role == "user":
      prompt += f"<|user|>\n{msg.content}</s>\n"
    if msg.role == "assistant":
      prompt += f"<|assistant|>\n{msg.content}</s>\n"

  prompt_str += "<|assistant|>\n"
  return prompt_str


llm = OpenAILike(....., completion_to_prompt=completion_to_prompt, messages_to_prompt=messages_to_prompt)
something like that, a little untested
I have had that happen too. Not sure what caused it. Maybe the memory window moving?
i chaged my startup params for llamcpp python to:
python3 -m llama_cpp.server \
--model /root/Models/tinyllama-1.1b-chat-v1.0.Q8_0.gguf \
--host 0.0.0.0 --port 5100 \
--interrupt_requests FALSE \
--n_threads 3 \
--n_gpu_layers 0 \
--chat_format chatml

and that seems to fix it,, (added the --chat_format)
Add a reply
Sign up and join the conversation on Discord