Find answers from the community

Updated 10 hours ago

Querying a Multimodal LLM Continuously While Retaining Memory

hello there, i am looking for a way to query a multimodal llm continously while retaining memory. much like simulating an user giving chatgpt an image and the first prompt and subsequently follow up prompts asking question about the image. is there a cookbook for this in llamaindex?
W
g
9 comments
Thanks. I am also wondering how I can make use of an agent. Sometimes I passed in an image that is weirdly orientated. I'm thinking to use an agent with tools to help me with that.

does this cookbook answer my question?

https://docs.llamaindex.ai/en/stable/examples/multi_modal/mm_agent/
You may not get exact use case in the doc for this but there are tons of examples in the docs that you can refer to.
There are lots of Agents examples, AgenticWorkflows, Workflow and lot more!
I have tried Workflow but feels conflicting because I can also do if else to control the overall pipeline. But letโ€™s see!
hey @WhiteFang_Jr how can I create an agent with tools out of this notebook?
you mean call a third partry service?
no, i mean something like this.
Plain Text
def get_image_orientation(angle: int) -> int:
    """Useful for checking image orientation and returning the
    degree of orientation as an integer (either one of 0, 90, 180, -90)."""
    
    return angle

orientation_tool = FunctionTool.from_defaults(
    fn=get_image_orientation,
)

tools = [orientation_tool]

mm_llm = OpenAI(
    model=llm_model_small,
    temperature=TEMPERATURE,
    api_key=openai_api_key,
    logsprob=None,
    default_headers={}
)

MESSAGE = """You are a helpful agent. You will be given an image. Your task
is to determine the orientation of the image. Return either 0, 90, 180, or -90."""

for img_b64 in base64_images:
image_document = Document(image_resource=MediaResource(data=img_b64))

    msg = ChatMessage(
        role=MessageRole.USER,
        blocks=[
            TextBlock(text=MESSAGE),
            ImageBlock(image=image_document.image_resource.data),
        ],
    )
    
    agent = OpenAIAgent.from_tools(
        tools=tools,
        llm=mm_llm,
        verbose=False,
        prefix_messages=[msg]
    )
    
    query = """Use your vision capabilities and from the image content,
    determine the orientation of this image"""
    
    response = agent.chat(message=query)
    print(str(response))
I got it to work, but i could not get the orientation of the document right ๐Ÿ˜‚
I think you should use GPT-4o vision that can help maybe
Add a reply
Sign up and join the conversation on Discord