Which agent are you using? I think you can do something like this:
from llama_index.core.multi_modal_llms.generic_utils import load_image_urls
image_urls = [
"https://res.cloudinary.com/hello-tickets/image/upload/c_limit,f_auto,q_auto,w_1920/v1640835927/o3pfl41q7m5bj8jardk0.jpg",
]
image_documents = load_image_urls(image_urls)
Then you can pass them to a multimodal model:
from llama_index.multi_modal_llms.openai import OpenAIMultiModal
openai_mm_llm = OpenAIMultiModal(
model="gpt-4-vision-preview", max_new_tokens=300
)
complete_response = openai_mm_llm.complete(
prompt="Describe the images as an alternative text",
image_documents=image_documents,
)
https://docs.llamaindex.ai/en/stable/examples/multi_modal/openai_multi_modal.html