Checkpoint

Question

I'm currently developing an agent workflow with human in the loop interaction and function calling. The workflow works great if the user stays in the session to complete it. I've tried both context serialization and checkpoints in order to persist context state with no success. I save the context/checkpoint after each iteration and load it back when starting the workflow as suggested in the documentation. I think the problem is with tool calling. Right after loading the checkpoint and adding the new user input, the agent gets stuck "thinking" ... as if it didn't know what steps is next.

Logan M · Answer

Context serialization should work. I triple checked the examples in the docs to make sure they serialize nicely
https://docs.llamaindex.ai/en/stable/examples/agent/agent_workflow_basic/#human-in-the-loop

Logan M · Answer

If you can reproduce in a Google colab, that'd be most helpful tbh

Ariel · Answer

@Logan M Thanks for your reply. It's a bit tricky to create a google colab version because the interaction happens in realtime over websockets. The specific use case I'm debugging is the user disconnecting (navigating away from the page...) and returning later. The context gets serialized as you suggested. But when the user comes back later, establishes a new websocket connection, and starts interacting with the agent to reply to the latest message, after 2 iterations I get:

Plain Text

llama_index.core.workflow.errors.WorkflowRuntimeError: Error in step 'run_agent_step': Error code: 400 - {'error': {'message': "Invalid parameter: Duplicate value for 'tool_call_id' of 'call_GwlY8eGG80cPcXhEhw5OSMP0', in messages[6] and messages[7].", 'type': 'invalid_request_error', 'param': 'messages.[7].tool_call_id', 'code': None}}

Ariel · Answer

It's funny because once the user returns, the first iteration (question/answer) works fine. And then the error pops up. It seems to me that one of the messages got lost somehow in between runs and it's not being persisted. I've attached the sample code I'm using for testing the worklow with FastAPI (I left out the workflow definition...since that seems to run just fine)

Ariel · Answer

So after exhaustive debugging, it looks like it gets stuck right after restoring the context. The context dict itself looks fine, it has the workflow state and all messages. I added a comment in the code where it gets stuck (no errors).
I'm wondering whether my FastAPI setup is right. Am I suppose to handle all websockets interactions within the handler.stream_events() loop? Or is it fine to break from the loop and receive websocket messages through the main websockets endpoint? Because it seems to stop working once breaking from the loop and trying to reenter later...
I get the feeling my approach may be wrong: I'm trying to manage the case where a user closes the websocket connection, comes back later opens a new one so the entire thing is run from top to bottom.

Find answers from the community

Checkpoint