Vision

At a glance

The community members are discussing the best vision model that performs as well as GPT-4, which is an LLM with a vision modality. Some suggestions include llama3.1/2 and the multimodal version, as well as closed-source models like sonnet and gemini. One community member mentions qwen but is unsure about its multimodal capabilities. There is also a discussion about the reliability of academic benchmarks, with one community member expressing skepticism. Another community member shares their experience with OpenAI's vision model, noting that it performs well on a document with a lot of numbers. Finally, a community member expresses interest in a similar model that can run locally.

ttarpus

What is the best vision model currently that performs as well as gpt4o? That is an LLM with vision modality.

7 comments

LLogan M

Like open source? Probably llama3.1/2 or whatever the multimodal one is. Although it will be no where close to openai lol

Closed source, sonnet and gemini are very good

ttarpus

Hi!

Either as long as they can run locally.

ttarpus

What about qwen

LLogan M

Haven't heard much about qwen's multimodal capability tbh

I wouldn't trust academic benchmarks too much. I feel like most models these days are overfitting to the benchmarks out there

ttarpus

oh, really? That's interesting.

ttarpus

with OAI vision for a 8000 character document (with lot's of numbers), I am getting 1 character wrong

ttarpus

i'd like a similar model but running locally

Add a reply

Find answers from the community

Vision