Find answers from the community

Updated 11 months ago

I'm using the multi_modal_pydantic

I'm using the multi_modal_pydantic notebook for pydantic + gpt-4 vision extraction and trying to adopt it for my use case.

Shouldn't the code below allow for multiple images to be loaded in so any images in restaurant_images can be called in sequence?:

Plain Text
from llama_index.multi_modal_llms import OpenAIMultiModal
from llama_index import SimpleDirectoryReader

# put your local directory here
image_documents = SimpleDirectoryReader("./restaurant_images").load_data()

openai_mm_llm = OpenAIMultiModal(
    model="gpt-4-vision-preview", api_key=OPENAI_API_TOKEN, max_new_tokens=1000
L
b
3 comments
I'm not sure what you mean? image_documents will be a list of all the images, but they are only loaded into memory when they are needed
I had to add some code to tell it to loop over multiple images in the OpenAI call.

ie: if there were 10 images in restaurant_images I wanted it to process all of them.
Plain Text
# Iterate over each image document and process it individually
for index, image_document in enumerate(ins_image_documents):
    print(f"--- Image {index + 1} ---")  # Adding an identifier for each image

    # Create a new program instance for each image
    openai_program_ins = MultiModalLLMCompletionProgram.from_defaults(
        output_parser=PydanticOutputParser(Menu),
        image_documents=[image_document],  # Process only the current image
        prompt_template_str=prompt_template_str,
        multi_modal_llm=openai_mm_llm,
        verbose=True,
    )

    # Process the image and print the response
    response = openai_program_ins()
    for res in response:
        print(res)

    print("\n")  # Add an extra newline for better separation between images
Add a reply
Sign up and join the conversation on Discord