I wanna finetune an embedding ("intfloat/multilingual-e...

At a glance

The community member is interested in finetuning an embedding model ("intfloat/multilingual-e5-large") for e-commerce products. They are curious about how the data should look like for training and finetuning. The community members discuss that for cases where there can be more than one product for a query, all the relevant products should be included in the answer. The data should be in the format of triplets, with the query, relevant passages, and hard negatives. An example of how the data might look like in the case of multiple products is provided.

HHK

@kapa.ai I wanna finetune an embedding ("intfloat/multilingual-e5-large") for e-commerce products and I am curious how the data should look like for training/finetuning? I read a few articles and basically what everybody is doing is just creating question/answer pairs and then using that to finetune the model but in my case, against one query/question there can be more than 1 product so should I add all the possible products in answer against a query and if yes then how the data should look like?

7 comments

kkapa.ai

Please stand by @HK as I search the knowledge base. I normally take about 20 seconds or so. :wave:

kkapa.ai

----------------------
Yes, you're on the right track. In the context of LlamaIndex, the process of finetuning an embedding model involves generating a synthetic question/answer dataset over any unstructured context, finetuning the model, and then evaluating the model source ().

For your specific case where there can be more than one product for a query, you can indeed include all the relevant products in the answer. The data for both training and evaluation should be in the format of triplets, where every row should have the following:

query: This represents the question or target.
relevant_passages: This represents a list of documents or passages that contain information that answers the query. For every query, there must be at least one relevant passage.
hard_negatives: These are chunks or passages that don't contain an answer for the query. Providing at least ~5 hard negatives can lead to meaningful improvement. However, hard negatives are optional source ().

Here's an example of how you can generate the question/answer pairs:

Plain Text

from llama_index.finetuning import generate_qa_embedding_pairs
from llama_index.core.evaluation import EmbeddingQAFinetuneDataset

train_dataset = generate_qa_embedding_pairs(train_nodes)
val_dataset = generate_qa_embedding_pairs(val_nodes)

train_dataset.save_json("train_dataset.json")
val_dataset.save_json("val_dataset.json")

source ()

kkapa.ai

Remember, the goal is to create a dataset where each pair of (generated question, text chunk used as context) becomes a datapoint in the finetuning dataset (either for training or evaluation).
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------

HHK

@kapa.ai can you please give me a few examples of how the data should look like in case of multiple products.

kkapa.ai

Let me check the knowledge base for you. One moment! :books:

kkapa.ai

----------------------
Sure, here's an example of how your data might look like when there are multiple products relevant to a single query. This is a hypothetical example and not directly from the provided sources, but it follows the structure suggested in the LlamaIndex documentation.

Plain Text

{
  "query": "What are some good kitchen appliances?",
  "relevant_passages": [
    {
      "title": "Instant Vortex 5.7QT Air Fryer Oven Combo",
      "category": "Kitchen Appliances",
      "discount": "20% off",
      "price": "$151.20",
      "rating": "4.7 out of 5 stars",
      "review": "5086 ratings",
      "description": "6-in-1 functionality; air fry, broil, bake, roast, reheat, and dehydrate. EvenCrisp Technology for crispy results. Easy to use touchscreen. Dishwasher safe parts. Cooks food faster and with less oil.",
      "inventory": "In stock"
    },
    {
      "title": "Product 2",
      "category": "Kitchen Appliances",
      "discount": "15% off",
      "price": "$200.00",
      "rating": "4.5 out of 5 stars",
      "review": "3000 ratings",
      "description": "Description for product 2.",
      "inventory": "In stock"
    }
  ],
  "hard_negatives": [
    {
      "title": "Non-Kitchen Product",
      "category": "Non-Kitchen Category",
      "discount": "10% off",
      "price": "$100.00",
      "rating": "4.0 out of 5 stars",
      "review": "2000 ratings",
      "description": "Description for a non-kitchen product.",
      "inventory": "In stock"
    }
  ]
}

kkapa.ai

In this example, the query is "What are some good kitchen appliances?". The relevant_passages field contains two products that are relevant to this query. Each product is represented as a dictionary with fields like title, category, price, etc. The hard_negatives field contains products that are not relevant to the query.
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------

Add a reply

Find answers from the community

I wanna finetune an embedding ("intfloat/multilingual-e5-large") for e-commerce products a