Hmm, I'm not even sure of the OpenAIAssistent
is setup to work with modalities beyond text
I'm also not sure if the intermediate steps are exposed?
Disclaimer: I've also never used openai-assistent, I'm just really familiar with the codebase
The intermediate steps , and pulling images, are exposed in the OpenAI api. Maybe it’s not possible via the llamaindex wrapper ?
I thiiiiink response.sources
may have what you are looking for
At least, thats where the tool calls and their outputs get placed
Thanks I will take a look , appreciate the pointer. I’m done for the day so it’ll be tomorrow
you can also do agent.chat_history
to get the chat history, but I'm not 100% what that will look like (its pulled from the openai api though)
probably also has some good details
OpenAI Assistant can respond with images. I think it is a llamindex agent issue. Also trying to figure this out.
The two options mentioned above don't help? I really felt like that response.sources
would have it, since that contains the output of every tool call used when creating the response
result.sources is an empty list so that doesn't seem to work. result.response is a stringn starting with "The chart above...". That's the LLM's response, it thinks it made a chart (and I believe it did too)
agent.chat_history has two messages in it. They look like thetprompt I used, and the text response. The messages, as well as the agent itself, have some extra properties like "files_dict" or "file_ids" but those are empty dicts
ah I see... looking at the code, sources only gets filled in for local tools... I think?
If you are up for it, would love to see a PR to fix this somehow
The chainlit example retrieves the steps from the run, and iterates over the steps, while LlamaIndex code doesn't do that. Is it possible that you need to evolve to iterate over the steps to capture the intermediate info including images, intermediate steps, etc?
I don't know the OpenAI api well enough to know for sure.
maybe steps aren't necessary except to show intermediate progress?
I see a few problems:
1) when you convert openai messages to a LlamaIndex ChatMessage, you drop messages if they aren't MessageContentText. So you drop MessageContentImageFile, for example :
def from_openai_thread_message(thread_message: Any) -> ChatMessage:
"""From OpenAI thread message."""
from openai.types.beta.threads import MessageContentText, ThreadMessage
thread_message = cast(ThreadMessage, thread_message)
# we don't have a way of showing images, just do text for now
text_contents = [
t for t in thread_message.content if isinstance(t, MessageContentText)
]
text_content_str = " ".join([t.text.value for t in text_contents])
2) smaller problem, the _chat() function only returns the latest message, and it seems like the chat response is a string only, while supporting the assistant API you would want to return something like a list of messages.
latest_message = self.latest_message
# get most recent message content
return AgentChatResponse(
response=str(latest_message.content),
sources=metadata["sources"],
)
The OpenAI message call actually returns all the messages properly, including file metadata:
raw_messages = self._client.beta.threads.messages.list(
thread_id=self._thread_id, order="desc"
)
So I guess I know how to do a PR to fix my scenario, but I don't know why the choices were made the way they were and if changing things would break anything. Can ChatMessage be extended to support OpenAI's MessageContentImageFile?
(note i just made a small edit to clarify my description of problem #1 above)
I think the OpenAIAssistant
class should be marked as beta, it was written in about a day to support their release 😅 I don't see too many people using it, hence why this is probably unaddressed until now
The ChatMessage
object has an additional_kwargs
field that can probably be leveraged here (and I see it's already being used to fill in some ID values)
I think another option is putting the image into the sources, since we don't technically have a clear abstraction yet to support chat messages that are images
Either option would probably work though
@Logan M I think you're right that ChatMessage's additional_kwargs can go a long way to solve this. The full OpenAI message is already in there anyway.
Here's how I think you could do this:
1) add a message_type in ChatMessage to distinguish normal text message from other messages. (One alternative is that an empty content string could indicate it's not a text message and the user has to look in the additional_kwargs to parse the message)
2) stop filtering out the unknown messages in from_openai_thread_message() -- this could have more implications, so it could also be an optional parameter
3) create a new wrapper around run_assistant() that returns a list of ChatMessages
I think that makes sense to me.
A little hessitant to change things related to the ChatMessage
object since it's fairly low-level and used in a lot of places, but I suppose some testing would ensure that it doesn't break anything haha
I know how to fix my use case now. But for fixing this in the library, it'd be good to get some guidance from people on the team who have a vision for how this should work. Maybe you're that person?
Also, if we want to support similar semantics as the existing chat() method, we'd have to deal with AgentChatResponse. That only holds a single string response, while it seems like it should contain a list of ChatMessages or something. Again something that could have more implications (including consumers of the event EventPayload.RESPONSE)
so i guess there are a few things to deal with to fix this properly
I think I disagree with that last point? You can always use agent.chat_history
to get chat messages
Since much of llama-index is based around text, handling images is still something we are figuring out for much of the library. Maybe this requires some new ImageChatMessage
object, but not sure.
In my mind, the image gets added as either a source, or as part of the message
I think even with text, it seems like an agent response should be a list of messages instead of a single message, since agents can have multiple steps
but those steps are things that inform the final response (i.e. a source)
you might be right - you're saying that in most cases you just want the final response, and if you DO want the other steps you go get the chat_history
i'm not that familiar with all the use cases of agents so that seems reasonable
Yea thats what I'm getting at. I think also the callbacks should be exposing intermediate steps as well, for applications that want to display that information in real time (I'm aware our callback coverage is not great at the moment too lol)
so then the issue is just that chat_history filters out non-text responses, and that ChatMessage doesn't have semantics for holding a non-text response
Yes, it would be great if the library could "yield" or call a callback with messages for progress steps.
yeah i think you can be compatible with this in OpenAI
but it doesn't work that way now when using the assistants API, right ?
only OpenAIAgent and ReActAgent