Find answers from the community

Updated 2 months ago

Hello How is everyone

Hello, How is everyone?

Can some one tell me if I can also embed the images from my pdfs. And query about those images. It seems to be not working for me and all the images are ignored from the pdf.
Or also help me in finding out the right way to embed or process the images. Any other technique, library or anything that can be handy?
L
A
T
14 comments
Yea I don't think the PDF loaders load images, except for the flat pdf loader on llama-hub

But even then, there's no embedding model for images right now. So instead ocr or image captioning is applied
You might have to write your own loader to do this the way you want
Right now, no one is really offering joint image-text embedding models except maybe google?
no idea how to use them though
How can I train my model to understand the images that I am embedding? any hint or documnetation reference
There are some LLMs that understand images as input. From my understanding, it's basically fusing a transform and visual transformer. These models require a lot of GPU power to run

But, that's about the limit of my knowledge on this topic lol
What you are looking for is a "MultiModal LLM"
LLamaindex offers easy integrations with some like Blip from Salesforce that can understand images at least to some degree:

Plain Text
from dotenv import load_dotenv
load_dotenv()

from pathlib import Path
from llama_index import download_loader
from llama_index.readers.base import BaseReader
from typing import Dict, List, Optional
from dataclasses import dataclass

@dataclass
class Document:
    text: str
    metadata: Dict

@dataclass
class ImageDocument(Document):
    image: str

class ImageCaptionReader(BaseReader):

    def __init__(
        self,
        parser_config: Optional[Dict] = None,
        keep_image: bool = False,
        prompt: Optional[str] = None,
    ):
        if parser_config is None:
            try:
                import sentencepiece
                import torch
                from PIL import Image
                from transformers import BlipForConditionalGeneration, BlipProcessor
            except ImportError:
                raise ImportError(
                    "Please install extra dependencies that are required for "
                    "the ImageCaptionReader: "
                    "`pip install torch transformers sentencepiece Pillow`"
                )

            device = "cuda" if torch.cuda.is_available() else "cpu"
            dtype = torch.float16 if torch.cuda.is_available() else torch.float32

            processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large")
            model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large", torch_dtype=dtype)

            parser_config = {
                "processor": processor,
                "model": model,
                "device": device,
                "dtype": dtype,
            }

        self._parser_config = parser_config
        self._keep_image = keep_image
        self._prompt = prompt

    def load_data(
        self, 
        file: Path, 
        extra_info: Optional[Dict] = None
    ) -> List[Document]:

        from PIL import Image
        from llama_index.img_utils import img_2_b64

        image = Image.open(file)
        if image.mode != "RGB":
            image = image.convert("RGB")

        image_str: Optional[str] = None
        if self._keep_image:
            image_str = img_2_b64(image)

        model = self._parser_config["model"]
        processor = self._parser_config["processor"]
        device = self._parser_config["device"]
        dtype = self._parser_config["dtype"]
        model.to(device)

        inputs = processor(image, self._prompt, return_tensors="pt").to(device, dtype)
        out = model.generate(**inputs)
        text_str = processor.decode(out[0], skip_special_tokens=True)

        return [
            ImageDocument(
                text=text_str,
                image=image_str,
                metadata=extra_info or {},
            )
        ]

ImageCaptionReader = download_loader("ImageCaptionReader")
loader = ImageCaptionReader()
documents = loader.load_data(file=Path('image.png'))

for document in documents:
    key, value = document
    if key == "text":
        print("Text:", value)
Yeah some thing like MultiModal LLM.
I think you'll need to wait for gpt-v to get a true multimodal LLM
LOL yes. May be or may be not.
@Teemu , do you have any idea what would be the best way to train my model or embed the images into my model?
If you want actual image embeddings you'll probably have to use Google Vertex API: https://cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-multimodal-embeddings?_ga=2.77296658.-643597094.1691443757

But you can also do the image captions method
Thanks @Teemu I will check
Add a reply
Sign up and join the conversation on Discord