It sounds like it can be handled by gpt-4v? You can create
ImageDocument
or
ImageNode
objects that point to the
image_path
where you've saved the image. These nodes can also optinally include text.
Our multi-modal stuff is still a bit in-progress, but lots of info here
https://docs.llamaindex.ai/en/stable/use_cases/multimodal.html