Sully

I was trying to build a web app like this for myself. They use vertex multimodal embedding for this demo & google actually posted this demo to showcase how multimodal vertex embeddings work

https://ai-demos.dev/

But in this example (the screenshot)

They were also able to get the timberlands part in the search (even with a typo) - not just the image part (black and green shoes) of my query

My question is how were they able to get the timberlands part - because with text-to-image vertex multimodal embeddings the app would just map the text query to the image of the product (not also to the product name text)

I just wanna know how they did it because its been confusin me for a min and I wanna build somethnig similar

Find answers from the community

I was trying to build a web app like