Find answers from the community

Updated 2 months ago

Image

At a glance

The community member is asking if LI (likely referring to Llama Index) has a solution to manage the size of a page when sending it to the OpenAI image model, specifically to balance resolution and token usage. The comments suggest that LI has added new features to handle multimodal data, including a new multimodal node and improved support for chat messages. One community member mentions that they have scanned physical documents as PDFs, which are essentially photos of document pages, and that this can be challenging to work with. The recommended approach is to use Llama Parse (or a similar tool) to perform OCR on the pages and send both the text and image to the language model. A link to an example notebook demonstrating this approach is provided.

Useful resources
does LI have a solution to managing the size of a page to send the oai image model?
to manage resolution vs. tokens?
L
t
7 comments
I think for resolution, you can just set low/high/auto for image details
https://github.com/run-llama/llama_index/blob/af9abd06a456a3745d02379f8afc4b6cab3a3f72/llama-index-integrations/multi_modal_llms/llama-index-multi-modal-llms-openai/llama_index/multi_modal_llms/openai/base.py#L60

I havent checked openais exact api to see if they have more controls than that recently
Will take a look. Glad to see you're in the room.

Thank you.
my pdf's are all scanned physial documents. so the pages are basically like photos of document pages.
they are a challenge to work with.
Typically the best approach we've seen is using llama parse (or something else) to ocr the page, and sending both the text and image to the llm

We have examples doing that 😁
Add a reply
Sign up and join the conversation on Discord