Image

At a glance

The community member is asking if LI (likely referring to Llama Index) has a solution to manage the size of a page when sending it to the OpenAI image model, specifically to balance resolution and token usage. The comments suggest that LI has added new features to handle multimodal data, including a new multimodal node and improved support for chat messages. One community member mentions that they have scanned physical documents as PDFs, which are essentially photos of document pages, and that this can be challenging to work with. The recommended approach is to use Llama Parse (or a similar tool) to perform OCR on the pages and send both the text and image to the language model. A link to an example notebook demonstrating this approach is provided.

Useful resources

ttarpus

does LI have a solution to managing the size of a page to send the oai image model?
to manage resolution vs. tokens?

7 comments

LLogan M

I think for resolution, you can just set low/high/auto for image details
https://github.com/run-llama/llama_index/blob/af9abd06a456a3745d02379f8afc4b6cab3a3f72/llama-index-integrations/multi_modal_llms/llama-index-multi-modal-llms-openai/llama_index/multi_modal_llms/openai/base.py#L60

I havent checked openais exact api to see if they have more controls than that recently

ttarpus

Will take a look. Glad to see you're in the room.

Thank you.

LLogan M

we added a new multimodal node https://github.com/run-llama/llama_index/pull/16962
improved multimodal support in chat messages https://github.com/run-llama/llama_index/pull/15969
Next up is updating the llms to work with the new chat messages
Then updating the retrievers to work with the new nodes
Then updating anything else remaining

ttarpus

my pdf's are all scanned physial documents. so the pages are basically like photos of document pages.

ttarpus

they are a challenge to work with.

LLogan M

Typically the best approach we've seen is using llama parse (or something else) to ocr the page, and sending both the text and image to the llm

We have examples doing that 😁

LLogan M

https://github.com/run-llama/llama_parse/blob/main/examples/multimodal/multimodal_rag_slide_deck.ipynb

Add a reply

Find answers from the community

Image