Hello everyone, I am new to developing,

ssotiris

Hello everyone, I am new to developing, finetuning and training llms. So for the last week I am going through llamaIndex and openai docs. I would really want to discuss and hear more experienced opinions on my matter.

The problem I want to solve is a type of classification I think. We have a large database of products created by different stores. Each store can create a product and they are free to use whatever name they want. There are some products that could be duplicate across stores but have slightly or entirely different names that a human would be able to identify and match them. So for example a Coca Cola 500ml in one store could be named Coca Cola medium in another one.

In my opinion LLMs are the best way to recognize such variations. (I would like to hear why this might not be a good idea.)

And I am trying to figure out the best way to provide a model with my huge data collection (about 50k tokens based on the openai calculator). Instruct the model accordingly and guide it to return the entire data collection this time grouped. Basically to create one product prototype for each product of every store and list all possible variations below.

Any ideas on how I should proceed?

I already had some code experimentations that gives me some promising results but it is not entirely reliable since the data size is huge and are getting separated into smaller prompts and the classification process is not working as expected. I am thinking after each response to feed the result back to the model finetune it and in case a product matches a previous existing result to include it there.

I would love to give more details if someone is interested.

32 comments

ssotiris

Here are the system instructions I came up with. I am not a prompt engineer so sorry if this is not a good one.

ssotiris

it seems like some of the text can not be shown so to avoid downloading here is the rest of it

Plain Text

# This is incorrect because it groups different products like croissants, mini croissants with different fillings, biscuits, and brands like 7Days and Molto - all under one prototype.



Guidelines:

- Only return the final JSON object, no other text.
- Handle all variations: misspellings, shorthand, descriptors, sizes, promotions, etc.
- Separate distinct products into different prototypes based on your best judgment.
- Do not group different products together.
- Update prototypes if a new unique product is identified.
- DO NOT SKIP PRODUCTS EVEN IF THE NAME IS THE SAME OR IN PLURAL.
- Mimic realistic human naming variations, misspellings, and marketing tactics.
- Each brand can have many products BE AWARE of not putting same brand but different products in the same prototype.
- Determine if two product names refer to the same underlying product or different products based on the following criteria:
  - If the names differ only in spelling variations, shorthand, descriptors (e.g., size, flavor), promotions, or brand names, treat them as variations of the same product.
  - If the names include different core product terms (e.g., "Coca-Cola" vs. "Pepsi", "Chips" vs. "Biscuits"), treat them as different products and assign separate prototypes.
  - If unsure, err on the side of treating them as different products to avoid incorrect grouping.

Find answers from the community

Hello everyone, I am new to developing,