Find answers from the community

Updated 6 months ago

Could anyone explain what a LlamaIndex

At a glance
Could anyone explain what a LlamaIndex Agent expects in an astream_chat() call when it receives tools to use? Specifically, in a non-streaming call, I send back a list of tools parsed from the LLM response in the request.tools list.

I'm trying to determine whether, on my backend, I should:
  1. Detect a tool call, stop yielding tokens to the agent until I encounter a stop token, then parse out the tools and send the final buffered request with request.tools included,
or

  1. Continue streaming tokens to the agent and only in the final response send the request back with request.tools.
Which approach is correct?
L
C
10 comments
Streaming with tool calls is tricky

The main requirement for agents is that it expects tool/function calls to be returned first

i.e. the first response dict from the llm should have tool calls, if tools are being called
Ok, so sounds like detect a "{" in the first token and build up a buffer with first delta response being the full message and a list of the tools.
Pretty much (to be clear, its expecting this to show up in the message.tool_calls part of the llm response, if using openai format/client)
A quick way to check if its working, we can stream tool calls

Plain Text
from llama_index.core.llms import ChatMessage
from llama_index.llms.openai import OpenAI
from pydantic.v1 import BaseModel
from typing import List
from IPython.display import clear_output
from pprint import pprint


class MenuItem(BaseModel):
    """A menu item in a restaurant."""

    course_name: str
    is_vegetarian: bool


class Restaurant(BaseModel):
    """A restaurant with name, city, and cuisine."""

    name: str
    city: str
    cuisine: str
    menu_items: List[MenuItem]


llm = OpenAI(model="gpt-3.5-turbo")

input_msg = ChatMessage.from_str("Generate a restaurant in Boston")

sllm = llm.as_structured_llm(Restaurant)
stream_output = sllm.stream_chat([input_msg])
for partial_output in stream_output:
    clear_output(wait=True)
    pprint(partial_output.raw.dict())
    restaurant_obj = partial_output.raw
Yea, Im trying to maintain parody with the ecosystem so using the OpenAI format everywhere. hope its the same that things like Ollama and other opensource tools use and what not....
Would like my chat and agent endpoints to work with all the tools if i point them to my api. But dont have a full grasp of eveything yet. Like right now my LlamaIndex agent is behind a /v1/chat/completion that proxys to my real compleation endpoint in another docker container at /v1/chat/completion. Dont know yet if this should really be implementing the OpenAIAgent endpoint or not...
My initail api completion endpoint is proyxying between my local docker LLM and a version of that image deployed to RunPod based on what LLM is set in the model field.
When my docker image boots up it negotiates a PKI key with my server so i can have private communication into memory of the Runpod host for some obfuscated privacy. so private as long as there not digging my prompts out of memory... and im not that special lol. So new keys negotiated for every chat after 60 seconds.
Got it working... Thanks again!
Add a reply
Sign up and join the conversation on Discord