I was trying to build a web app like this for myself. They use vertex multimodal embedding for this demo & google actually posted this demo to showcase how multimodal vertex embeddings work
https://ai-demos.dev/But in this example (the screenshot)
They were also able to get the timberlands part in the search (even with a typo) - not just the image part (black and green shoes) of my query
My question is how were they able to get the timberlands part - because with text-to-image vertex multimodal embeddings the app would just map the text query to the image of the product (not also to the product name text)
I just wanna know how they did it because its been confusin me for a min and I wanna build somethnig similar