Find answers from the community

Updated 8 months ago

I was trying to build a web app like

I was trying to build a web app like this for myself. They use vertex multimodal embedding for this demo & google actually posted this demo to showcase how multimodal vertex embeddings work

https://ai-demos.dev/

But in this example (the screenshot)

They were also able to get the timberlands part in the search (even with a typo) - not just the image part (black and green shoes) of my query

My question is how were they able to get the timberlands part - because with text-to-image vertex multimodal embeddings the app would just map the text query to the image of the product (not also to the product name text)

I just wanna know how they did it because its been confusin me for a min and I wanna build somethnig similar
Attachment
image.png
L
S
21 comments
Maybe they embed the images and their captions. Maybe their text-to-image embedding models were trained with pairs of timberland shoes to text

Like from the embedded text, the embeddings can technically know what that text looks like, this is kind of how text-to-image works right?
thats the thing They use google vertex multimodal embedding they don't use anything else. This is demo from google
. So they embed the image

. They also embed the title of the product it seems

. Somehow they account for both these embeddings (vertex multimodal embeds both image and text)

. So when they pass a query it looks at image embedding of the product + text embedding of the (name of product)

. Thats where im lost how do they store both those embeddings at once and account for both of them when they pass in a text query
you can store both embeddings in seperate vector indexes, retrieve from both, and then do some deduplicating and ranking
interesting hmm
I think I can do that with pinecone lowkey
also since the multimodal embedding is like Clip so the dimensions are the same for text abd image maybe we can store em both in the same index? πŸ€” how would teh retrievin work tho?

Do vector database you to do that
the fact that you said deduplicating and ranking tells me that finally someone acc gets what im tryna do πŸ˜†
but im still a bit lost about how that would work lowkey
Yea since we are using a single embedding model, it could be the same index. (sometimes you might use a different model for pure text)

Since you would have multiple embeddings representing the same object, you need to do some fusion algorithm

Stuff like relative ranking fusion, dist-based, etc.

This is a good starter article for this kind of thing
https://weaviate.io/blog/hybrid-search-fusion-algorithms
interestin yup ill look into that (does pinecone have something like that built in)

I know pinecone I tried hybrid search (a bit different than fusion or using 2 embeddings)

basically we use dense vectors (the image embeddings) and sparse vectors (the text part)
This issue with this is because we dont use vector embeddinsg for text and inside sparse encoders like splade and bm25 (keyword search) - it doesn't account for typos and meaning of the text
so you think google in their demo here

https://ai-demos.dev/

uses some sort of fusion algorithm (wish i could see their source code lol for this demo)
Not built in, fusion is something you can apply directly when combining sets of retrieved nodes that might be pointing to the same object/source.

Yea sparse vectors is something else, its not technically used here
yh interetsing any good tutorial on it for pinecone
I read this artilce and it talked about https://weaviate.io/blog/hybrid-search-fusion-algorithms

the sparse and dense embeddings which is not what im looking because its not the one implemented in the demo I shared
like from theoretical perspective how would it work
like lets say we have an image embedding (product image) and text embedding (product name)

how do we go about storing and retrievin em
the way the demo does so that the result is one
im sure this can be done with pinecone ( i prefer it because its prod ready) also pinecone supports sparse vectors as well so I assume this could be done with it
I dont think sparse vectors are needed here, but the fusion algorithms it takes about can be applied to any collection of nodes

Plain Text
retrieved_nodes1 = ..
retrieved_nodes2 = ..

<apply fusion algorithm>
interesting how would it work tho also i assume node is the object retruned when we performa. search (not used much in pinecone doc lol)

so what would the fusion algo be if we retrieve teh text embedding of product name and the image embedding of product image?
Add a reply
Sign up and join the conversation on Discord