Hello How is everyone

AAsh_

Hello, How is everyone?

Can some one tell me if I can also embed the images from my pdfs. And query about those images. It seems to be not working for me and all the images are ignored from the pdf.
Or also help me in finding out the right way to embed or process the images. Any other technique, library or anything that can be handy?

14 comments

LLogan M

Yea I don't think the PDF loaders load images, except for the flat pdf loader on llama-hub

But even then, there's no embedding model for images right now. So instead ocr or image captioning is applied

LLogan M

You might have to write your own loader to do this the way you want

LLogan M

Right now, no one is really offering joint image-text embedding models except maybe google?

LLogan M

no idea how to use them though

AAsh_

How can I train my model to understand the images that I am embedding? any hint or documnetation reference

LLogan M

There are some LLMs that understand images as input. From my understanding, it's basically fusing a transform and visual transformer. These models require a lot of GPU power to run

But, that's about the limit of my knowledge on this topic lol

LLogan M

What you are looking for is a "MultiModal LLM"

TTeemu

LLamaindex offers easy integrations with some like Blip from Salesforce that can understand images at least to some degree:

Plain Text

from dotenv import load_dotenv
load_dotenv()

from pathlib import Path
from llama_index import download_loader
from llama_index.readers.base import BaseReader
from typing import Dict, List, Optional
from dataclasses import dataclass

@dataclass
class Document:
    text: str
    metadata: Dict

@dataclass
class ImageDocument(Document):
    image: str

class ImageCaptionReader(BaseReader):

    def __init__(
        self,
        parser_config: Optional[Dict] = None,
        keep_image: bool = False,
        prompt: Optional[str] = None,
    ):
        if parser_config is None:
            try:
                import sentencepiece
                import torch
                from PIL import Image
                from transformers import BlipForConditionalGeneration, BlipProcessor
            except ImportError:
                raise ImportError(
                    "Please install extra dependencies that are required for "
                    "the ImageCaptionReader: "
                    "`pip install torch transformers sentencepiece Pillow`"
                )

            device = "cuda" if torch.cuda.is_available() else "cpu"
            dtype = torch.float16 if torch.cuda.is_available() else torch.float32

            processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large")
            model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large", torch_dtype=dtype)

            parser_config = {
                "processor": processor,
                "model": model,
                "device": device,
                "dtype": dtype,
            }

        self._parser_config = parser_config
        self._keep_image = keep_image
        self._prompt = prompt

    def load_data(
        self, 
        file: Path, 
        extra_info: Optional[Dict] = None
    ) -> List[Document]:

        from PIL import Image
        from llama_index.img_utils import img_2_b64

        image = Image.open(file)
        if image.mode != "RGB":
            image = image.convert("RGB")

        image_str: Optional[str] = None
        if self._keep_image:
            image_str = img_2_b64(image)

        model = self._parser_config["model"]
        processor = self._parser_config["processor"]
        device = self._parser_config["device"]
        dtype = self._parser_config["dtype"]
        model.to(device)

        inputs = processor(image, self._prompt, return_tensors="pt").to(device, dtype)
        out = model.generate(**inputs)
        text_str = processor.decode(out[0], skip_special_tokens=True)

        return [
            ImageDocument(
                text=text_str,
                image=image_str,
                metadata=extra_info or {},
            )
        ]

ImageCaptionReader = download_loader("ImageCaptionReader")
loader = ImageCaptionReader()
documents = loader.load_data(file=Path('image.png'))

for document in documents:
    key, value = document
    if key == "text":
        print("Text:", value)

AAsh_

Yeah some thing like MultiModal LLM.

TTeemu

I think you'll need to wait for gpt-v to get a true multimodal LLM

AAsh_

LOL yes. May be or may be not.

AAsh_

@Teemu , do you have any idea what would be the best way to train my model or embed the images into my model?

TTeemu

If you want actual image embeddings you'll probably have to use Google Vertex API: https://cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-multimodal-embeddings?_ga=2.77296658.-643597094.1691443757

But you can also do the image captions method

AAsh_

Thanks @Teemu I will check

Add a reply

Find answers from the community

Hello How is everyone