Querying a Multimodal LLM Continuously While Retaining Memory

Question

hello there, i am looking for a way to query a multimodal llm continously while retaining memory. much like simulating an user giving chatgpt an image and the first prompt and subsequently follow up prompts asking question about the image. is there a cookbook for this in llamaindex?

WhiteFang_Jr · Answer

This example can help you: https://docs.llamaindex.ai/en/stable/examples/multi_modal/azure_openai_multi_modal/

galvangjx · Answer

Thanks. I am also wondering how I can make use of an agent. Sometimes I passed in an image that is weirdly orientated. I'm thinking to use an agent with tools to help me with that.

does this cookbook answer my question?

https://docs.llamaindex.ai/en/stable/examples/multi_modal/mm_agent/

WhiteFang_Jr · Answer

You may not get exact use case in the doc for this but there are tons of examples in the docs that you can refer to. There are lots of Agents examples, AgenticWorkflows, Workflow and lot more!

galvangjx · Answer

I have tried Workflow but feels conflicting because I can also do if else to control the overall pipeline. But let’s see!

galvangjx · Answer

hey @WhiteFang_Jr how can I create an agent with tools out of this notebook?

WhiteFang_Jr · Answer

you mean call a third partry service?

galvangjx · Answer

no, i mean something like this.def get_image_orientation(angle: int) -> int: """Useful for checking image orientation and returning the degree of orientation as an integer (either one of 0, 90, 180, -90).""" return angle orientation_tool = FunctionTool.from_defaults( fn=get_image_orientation,
) tools = [orientation_tool] mm_llm = OpenAI( model=llm_model_small, temperature=TEMPERATURE, api_key=openai_api_key, logsprob=None, default_headers={}
) MESSAGE = """You are a helpful agent. You will be given an image. Your task
is to determine the orientation of the image. Return either 0, 90, 180, or -90.""" for img_b64 in base64_images:
image_document = Document(image_resource=MediaResource(data=img_b64)) msg = ChatMessage( role=MessageRole.USER, blocks=[ TextBlock(text=MESSAGE), ImageBlock(image=image_document.image_resource.data), ], ) agent = OpenAIAgent.from_tools( tools=tools, llm=mm_llm, verbose=False, prefix_messages=[msg] ) query = """Use your vision capabilities and from the image content, determine the orientation of this image""" response = agent.chat(message=query) print(str(response))

galvangjx · Answer

I got it to work, but i could not get the orientation of the document right 😂

WhiteFang_Jr · Answer

I think you should use GPT-4o vision that can help maybe

Find answers from the community

Querying a Multimodal LLM Continuously While Retaining Memory