Imagez

At a glance

The community member has provided a code snippet that sets up a multi-modal vector store index using the Llama Index library and OpenAI's GPT-4 vision model. The code retrieves text and image nodes from the index, and attempts to use a multi-modal query engine to describe the images as if from the perspective of a blind person. However, the query engine is not working as expected, and the community member is seeking help to understand why the images are not being properly utilized in the response. The comments suggest that the community member has tried various approaches, such as checking the source nodes and prompt engineering, but is still unable to get the desired behavior.

ZZachHandley

Plain Text

import openai
import os
from llama_index.multi_modal_llms.openai import OpenAIMultiModal
from llama_index.indices.multi_modal.base import MultiModalVectorStoreIndex
from llama_index.query_engine import SimpleMultiModalQueryEngine
from llama_index import SimpleDirectoryReader
from llama_index.vector_stores import QdrantVectorStore
from qdrant_client import QdrantClient

from dotenv import load_dotenv

load_dotenv()

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
openai.api_key = OPENAI_API_KEY

client = QdrantClient(url="http://localhost")
openai_mm_llm = OpenAIMultiModal(
    model="gpt-4-vision-preview",
    api_key=os.getenv("OPENAI_API_KEY"),
    max_new_tokens=1500,
)
vector_store = QdrantVectorStore(
    "global_text_store",
    client=client,
)
image_store = QdrantVectorStore(
    "global_image_store",
    client=client,
)
index = MultiModalVectorStoreIndex.from_vector_store(
    vector_store=vector_store,
    image_vector_store=image_store,
    use_async=False,
    show_progress=True,
)
retriever = index.as_retriever()
image_nodes = retriever.retrieve("Find images in the knowledgebase.")
print("Image Nodes: ", image_nodes)
query_engine = SimpleMultiModalQueryEngine(
    retriever=index.as_retriever(),
    openai_mm_llm=openai_mm_llm,
)

response_1 = query_engine.query(
    "Describe the images in your knowledgebase as if you were a blind person.",
)
print("Response: ", response_1)

This works for the retriever, it retrieves text and image nodes, it does NOT work for the query

13 comments

LLogan M

If you check response.source_nodes, you can see the nodes it used to make the response.

You might have to prompt engineer a bit tbh to get it to pay attention to thr image properly

ZZachHandley

so I have 10 images and the only thing the text stores say is what I sent you before

ZZachHandley

"This is a test text node" something like that

ZZachHandley

the images are all unique, but even if I tell it to just summarize the images it still won't help me

LLogan M

Are the images in the response.source_nodes?

ZZachHandley

Plain Text

Response:  I'm sorry, but I cannot provide a description for the image as there seems to be a misunderstanding. There is no image attached to your query for me to describe. If you have an image you would like me to describe, please provide it, and I will do my best to give you a detailed description.
Response Source Nodes:  [NodeWithScore(node=TextNode(id_='a6f532dc-6d82-4ab3-9622-36467d7870cc', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='1234567890_916035975470858.0', node_type=<ObjectType.DOCUMENT: '4'>, metadata={}, hash='741c7c6c958537b3d737de4cf0c09fc96f607abcfdab03d6b911c8094be9257d')}, hash='741c7c6c958537b3d737de4cf0c09fc96f607abcfdab03d6b911c8094be9257d', text='This is a test text', start_char_idx=0, end_char_idx=19, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n'), score=0.75210965), NodeWithScore(node=ImageNode(id_='ada19de9-730c-47f0-89bf-fa4158e31f8c', embedding=None, metadata={'user_id': '1234567890'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='133857111694341.78', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'user_id': '1234567890'}, hash='ec5b518200bee786ad0d6504e77971bee8a31e74cf2f4dd3ff028db740a49744')}, hash='ec5b518200bee786ad0d6504e77971bee8a31e74cf2f4dd3ff028db740a49744', text='', start_char_idx=0, end_char_idx=0, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n', image='/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAxNDQ0Hyc5PTgyPC4zNDL/2wBDAQkJCQwLDBgNDRgyIRwhMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjL/wAARCAH0AfQDASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUFBAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVWV

ZZachHandley

so yes

ZZachHandley

but if my query is literally, "Describe the images in your knowledgebase as if you were a blind person" isn't that... enough?

LLogan M

I have no idea. I know when I tried with your sample repo before, it was complaining about the image not being related to the text or query lol

ZZachHandley

Hm. But I'm just asking it to utilize the images.

ZZachHandley

If it won't even use them then what's the point of all this? How do I force it to use them then? Do I need to make my own response synthesizer?

ZZachHandley

when I give ChatGPT an image and I say, "Please describe what's in the image", so I guess I need to AI generate the metadata tags using the image document

Add a reply

Find answers from the community

Imagez