Find answers from the community

Updated 7 months ago

I am working on finetuning embeddings

At a glance
I am working on finetuning embeddings for my RAG pipeline and I have followed LLamaindex finetuning to finetune "intfloat/multilingual-e5-large-instruct". I have ecomemrce products data (product description) and description is having some numeric data as well for example, there can be size in cm (100 cm, 100 mm. 100 m) and there can be price in numeric format and there can be other numeric data as well. According to my experiments, Embedding models are not good with numeric data and if the search query is "XYZ with 50 cm long and 1000$ price" then it gives me the products related to XYZ ignoring the 50 cm and 1000$ price.

That's why I finetuned a model where I used GPT4 to generate synthetic search queries for each product and finetuned a model but finetuned model is worse than the baseline model. In synthetic queries, I made sure to include this numeric data as well according to each product but it seems like this is not working. What am I doing wrong?
T
H
R
52 comments
How exactly did you make the synthetic queries? I recently built an e-Commerce product data pipeline for a client but instead of using semantic search for numerical values, I used Qdrant filters: https://docs.llamaindex.ai/en/stable/examples/vector_stores/Qdrant_metadata_filter/

If you use the range filters for example, you'll be able to restrict the search very accurately. If you need to construct the filters automatically from natural language, you can use this: https://docs.llamaindex.ai/en/stable/examples/vector_stores/chroma_auto_retriever/
I have used GPT4 with this custom prompt as a system prompt:

Given the context information and no prior knowledge, generate only search queries in English and German for e-commerce products. Assume expertise in crafting user search queries tailored to online shopping platforms. Your primary task is to establish two diverse search queries that a user might employ to locate these products. These queries should be varied and accurately reflect typical search behaviors. Incorporate any given attributes into the queries. Always produce search queries based on the provided context, regardless of the product type.

And passed one product at a time to generate search queries and mostly search queries are like this;

  1. Fast cement in cream color 10 kg
  2. Dark green poster paint Oecoplan 350 ml
  3. Sigma Velocomputer BC 7.16 Torx drive 2 year warranty
One question about metadatafilters from QDRant Filters; I have only text (no filters defined in my data for any product) so the only thing I have is the text for each product so how can I use metadata filters? To the best of my understading, in order to use the metadata filters you need to have some predefined filters before in the nodes while creating an index.
But I also feel like, the gemerated synthetic dataset isn't good because queries are way too specefic for example this one "Fluorocarbon fishing line Momoi Hi-Catch Neo 0.2 mm 50m" and I am sure no one (user) will type a search query like this find a product but the sad part is our QA team is doing testing on these type of queries, they also want to find the products like this query "XYZ with 50 cm long and price less than 1000 euros" now it's nearly impossible to find the relevant products with semantic similarity without metadata or SQL search.
The approach mentioned by Teemu is really reliable and you should try it.
Another solution that might work here would be text to sql
Hey @Roland Tannous , thanks for the reply. The data itself is the problem. The only thing about the products that I have is "text" description of the products. So to go with filters or text-to-sql, I will have to first extract all the possible attributes/entites and then use filters or store these extracted attributes in SQL and do text-to-sql. And I have more than 100K products so this solution will be a bit expensive.
have you tried an LLM to extract?
you could also use traditional NLU techniques like regex or slot filling .. or. acombination..
shouldn't be expensive
there might be another way
are you familiar with the concept of using multiple vectors to embed one item?
what vector db are you using?
actually in your case multimodal is less relevant. I thought you had pictures too.
But the idea is this:
use a multivalue query operator with weights. There is an algorithm for it called "Weak AND/Weighted AND"
but to make things easier for you, marqo , the vectordb has this implemented πŸ™‚
so try marqo, there is a free community version and try the weighted terms in your search
the example they give is:
Plain Text
mq.index("my-first-index").search(
    {
        "red t-shirt": 1.0,
        "short sleeve": 0.3,
        "buttons": -0.4,
        "low resolution, blurry, jpeg artifacts": -0.2,
    }
)
this tells the query engine the weights of the different search terms.
to make your data work for this, you might need to do some data manipulation. nothing that can't be automated.
Let's say you have a product description: "red t-shirt with size 100cm at 50$"
you might need to turn it to :
red t-shirt size-100 price-50
then when a user searches:
red t-shirt with a size 100 cm and price 50 , you can use either regex or llm to actually turn it into a json with "red t-shirt size-100 price-50" like this:
{
"feature": "red t-shirt",
"feature": "size-100",
"feature": "price-50"
}
and pass it to the marqo query engine with weights, example:
Plain Text
mq.index("my-first-index").search(
    {
        "red t-shirt": 1.0,
        "size-100": 1,
        "price-50": 1,

    }
)
just conceptualizing here, but this algorithm should work.
Thanks @Roland Tannous for sending the resources. I will definitely check out multi-term-queries. And I tried LLM but there are lots of products with no categories at all so which attributes to extract as there can be so many? I can with Pydantic extraction chain but for some products, an attribute A is important and for other B, and for some both A and B. And sometime there are inconsistencies in the extracted attributes as well. For example, a brand can be "Marcedez", "Marcedez Benz" and I have found so many inconsistencies in the extracted data as well so SQL search query is uncertain.
well you're lucky because
marqo implements multi-vectors (it uses vespa as a backend)
and one of the collateral effects is that it can still find the right hits even if there are differences in spelling or even spelling mistakes
I actually discovered marqo as i was trying to solve an issue with Name spelling mistakes in AI chats.. other embedding model+ vector DB combinations would get totally lost and fail to find the right hits
That sounds good, I am going through their docs right now. If it can find the right hit with different spelling then it will be really useful. Thanks πŸ™‚
let me show you a visual quickly
most vector db , query engines, save embeddings as a vector
marqo saves them as "a tensor" (albeit i am not a fan of the the term. They could have used another name :P)
that's why even with noise (spelling mistakes, etc) , it can still find the right hits
the name : "Philo of Bizantium"
using Elastic search + OpenAI embeddings, if someone made a spelling mistake and said "Philo of Byzantium" <-- y instead of i
in teh query : "what book did Philo of Byzantium write"?
RAG would fail to find the right hit!
with marqo and all-MiniLM-L6-v2 , which is like a very old embedding model
I butchered the name on purpose with 4-5 spelling mistakes
writing Philo of pantium instead of Philo of Byzantium in the query
and it still found it
Vector embedding tensors are your friend. While they're usually used for multidimensional modals like images, videos... they seem to have this positive side effect on text too
let me know how your experiment goes. gotta go.
Thanks Roland for all the help, that sounds promising and I currently exploring the docs of marqo. Will keep you posted on the results of my experiments.
Add a reply
Sign up and join the conversation on Discord