I am working on finetuning embeddings

At a glance

I am working on finetuning embeddings for my RAG pipeline and I have followed LLamaindex finetuning to finetune "intfloat/multilingual-e5-large-instruct". I have ecomemrce products data (product description) and description is having some numeric data as well for example, there can be size in cm (100 cm, 100 mm. 100 m) and there can be price in numeric format and there can be other numeric data as well. According to my experiments, Embedding models are not good with numeric data and if the search query is "XYZ with 50 cm long and 1000$ price" then it gives me the products related to XYZ ignoring the 50 cm and 1000$ price.

That's why I finetuned a model where I used GPT4 to generate synthetic search queries for each product and finetuned a model but finetuned model is worse than the baseline model. In synthetic queries, I made sure to include this numeric data as well according to each product but it seems like this is not working. What am I doing wrong?

52 comments

TTeemu

How exactly did you make the synthetic queries? I recently built an e-Commerce product data pipeline for a client but instead of using semantic search for numerical values, I used Qdrant filters: https://docs.llamaindex.ai/en/stable/examples/vector_stores/Qdrant_metadata_filter/

If you use the range filters for example, you'll be able to restrict the search very accurately. If you need to construct the filters automatically from natural language, you can use this: https://docs.llamaindex.ai/en/stable/examples/vector_stores/chroma_auto_retriever/

HHK

I have used GPT4 with this custom prompt as a system prompt:

Given the context information and no prior knowledge, generate only search queries in English and German for e-commerce products. Assume expertise in crafting user search queries tailored to online shopping platforms. Your primary task is to establish two diverse search queries that a user might employ to locate these products. These queries should be varied and accurately reflect typical search behaviors. Incorporate any given attributes into the queries. Always produce search queries based on the provided context, regardless of the product type.

And passed one product at a time to generate search queries and mostly search queries are like this;

Fast cement in cream color 10 kg
Dark green poster paint Oecoplan 350 ml
Sigma Velocomputer BC 7.16 Torx drive 2 year warranty

One question about metadatafilters from QDRant Filters; I have only text (no filters defined in my data for any product) so the only thing I have is the text for each product so how can I use metadata filters? To the best of my understading, in order to use the metadata filters you need to have some predefined filters before in the nodes while creating an index.

HHK

But I also feel like, the gemerated synthetic dataset isn't good because queries are way too specefic for example this one "Fluorocarbon fishing line Momoi Hi-Catch Neo 0.2 mm 50m" and I am sure no one (user) will type a search query like this find a product but the sad part is our QA team is doing testing on these type of queries, they also want to find the products like this query "XYZ with 50 cm long and price less than 1000 euros" now it's nearly impossible to find the relevant products with semantic similarity without metadata or SQL search.

Find answers from the community

I am working on finetuning embeddings