I'm using the multi_modal

At a glance

The community member is using the multi_modal_pydantic notebook for pydantic and gpt-4 vision extraction, and is trying to adapt it for their use case. They are wondering if the provided code should allow for loading multiple images from the restaurant_images directory and processing them in sequence.

In the comments, another community member suggests that image_documents will be a list of all the images, but they are only loaded into memory when needed. Another community member adds that they had to write additional code to loop over multiple images in the OpenAI call, in order to process all the images in the restaurant_images directory.

There is no explicitly marked answer in the post or comments.

bbk

I'm using the multi_modal_pydantic notebook for pydantic + gpt-4 vision extraction and trying to adopt it for my use case.

Shouldn't the code below allow for multiple images to be loaded in so any images in restaurant_images can be called in sequence?:

Plain Text

from llama_index.multi_modal_llms import OpenAIMultiModal
from llama_index import SimpleDirectoryReader

# put your local directory here
image_documents = SimpleDirectoryReader("./restaurant_images").load_data()

openai_mm_llm = OpenAIMultiModal(
    model="gpt-4-vision-preview", api_key=OPENAI_API_TOKEN, max_new_tokens=1000

3 comments

LLogan M

I'm not sure what you mean? image_documents will be a list of all the images, but they are only loaded into memory when they are needed

bbk

I had to add some code to tell it to loop over multiple images in the OpenAI call.

ie: if there were 10 images in restaurant_images I wanted it to process all of them.

bbk

Plain Text

# Iterate over each image document and process it individually
for index, image_document in enumerate(ins_image_documents):
    print(f"--- Image {index + 1} ---")  # Adding an identifier for each image

    # Create a new program instance for each image
    openai_program_ins = MultiModalLLMCompletionProgram.from_defaults(
        output_parser=PydanticOutputParser(Menu),
        image_documents=[image_document],  # Process only the current image
        prompt_template_str=prompt_template_str,
        multi_modal_llm=openai_mm_llm,
        verbose=True,
    )

    # Process the image and print the response
    response = openai_program_ins()
    for res in response:
        print(res)

    print("\n")  # Add an extra newline for better separation between images

Add a reply

Find answers from the community

I'm using the multi_modal_pydantic