Find answers from the community

Updated 9 months ago

Hello everyone, I am new to developing,

Hello everyone, I am new to developing, finetuning and training llms. So for the last week I am going through llamaIndex and openai docs. I would really want to discuss and hear more experienced opinions on my matter.

The problem I want to solve is a type of classification I think. We have a large database of products created by different stores. Each store can create a product and they are free to use whatever name they want. There are some products that could be duplicate across stores but have slightly or entirely different names that a human would be able to identify and match them. So for example a Coca Cola 500ml in one store could be named Coca Cola medium in another one.

In my opinion LLMs are the best way to recognize such variations. (I would like to hear why this might not be a good idea.)

And I am trying to figure out the best way to provide a model with my huge data collection (about 50k tokens based on the openai calculator). Instruct the model accordingly and guide it to return the entire data collection this time grouped. Basically to create one product prototype for each product of every store and list all possible variations below.

Any ideas on how I should proceed?

I already had some code experimentations that gives me some promising results but it is not entirely reliable since the data size is huge and are getting separated into smaller prompts and the classification process is not working as expected. I am thinking after each response to feed the result back to the model finetune it and in case a product matches a previous existing result to include it there.

I would love to give more details if someone is interested.
s
D
32 comments
Here are the system instructions I came up with. I am not a prompt engineer so sorry if this is not a good one.
it seems like some of the text can not be shown so to avoid downloading here is the rest of it

Plain Text
# This is incorrect because it groups different products like croissants, mini croissants with different fillings, biscuits, and brands like 7Days and Molto - all under one prototype.



Guidelines:

- Only return the final JSON object, no other text.
- Handle all variations: misspellings, shorthand, descriptors, sizes, promotions, etc.
- Separate distinct products into different prototypes based on your best judgment.
- Do not group different products together.
- Update prototypes if a new unique product is identified.
- DO NOT SKIP PRODUCTS EVEN IF THE NAME IS THE SAME OR IN PLURAL.
- Mimic realistic human naming variations, misspellings, and marketing tactics.
- Each brand can have many products BE AWARE of not putting same brand but different products in the same prototype.
- Determine if two product names refer to the same underlying product or different products based on the following criteria:
  - If the names differ only in spelling variations, shorthand, descriptors (e.g., size, flavor), promotions, or brand names, treat them as variations of the same product.
  - If the names include different core product terms (e.g., "Coca-Cola" vs. "Pepsi", "Chips" vs. "Biscuits"), treat them as different products and assign separate prototypes.
  - If unsure, err on the side of treating them as different products to avoid incorrect grouping.
Instead of trying to encapsulate all of it with purely prompt engineering, I would use RAG for it. Here's what I would do:

  1. upload and index your documents to llamaindex along with a "heuristics file":
Plain Text
  
  const products = await fs.readFile(
    "products.json"
  );

  const heuristics = await fs.readFile(
    "heuristics.txt"
  );

  const productsDoc = new Document({ text: JSON.stringify(products) });

  const heuristicsDoc = new Document({ text: heuristics }); 


  1. the heuristics file covers nuances that an LLM would probably miss. for example:
Plain Text
500ml means medium
cherries and gummy cherries are the same thing
croissants and mini croissants are not the same thing
7days is a brand name
coke, coca cola, COCA COLA, etc. are all the same thing 
...

you don't have to line-by-line list everything, just the easily confused ones, like you listed here

  1. now your prompt can be much simpler, focusing on the format of output you want
should be able to query it like this:

Plain Text
const index = await VectorStoreIndex.fromDocuments([productsDoc, heuristicsDoc]);

 const queryEngine = index.asQueryEngine();
  
 const response = await queryEngine.query({
   query: "how many haribo gummy cherries are there?",
 });


^ if your inventory numbers are correct, and the LLM knows which ones are the same, it should get this answer correct. your prompt would just focus on turning it into JSON
Thank you for such a quick respond I wasn't expecting that!

This is really helpful!

I like your approach, even though I am not entirely sure I understand what a RAG is yet.

In your example the response of the query is indeed heuristic.

Instead though the way I want to utilize the model is like a tool that will finally provide me my data in groups just like this example:

Plain Text
Given:

"Lay's Classic Chips", "LAYS original chips", "lay Chips family size", "lay's salt&vinegar chips", "lays plain chips", "LAYS Salted Chips", "lays salt n vinegar chips", "lays only salt chips"

You would return: 

{
   "LAYS CLASSIC CHIPS": [
       "Lay's Classic Chips",
       "LAYS original chips",
       "lay Chips family size"
   ],
   "LAYS SALT & VINEGAR CHIPS": [
       "lay's salt&vinegar chips",
       "lays salt n vinegar chips"
   ],
   "LAYS SALTED CHIPS": [
       "lays plain chips",
       "LAYS Salted Chips", 
       "lays only salt chips"
   ]
}


Meaning I want it to give me responses with group products and after a certain point the current response should consider all previous responses somehow.
yep im leaving that part up to you πŸ˜„
RAG is basically the idea that you get the LLM to retrieve from a file
so instead of going back-and-forth with an LLM with prompts, you can say "everything in this file: you know"
hm okay I get it. So I could possible recursively update the files with the classification that already happened in order to "know" what we got so far.
I am worried though that expecting responses that are simple json data is not the expected use of an LLM model so maybe that's why it would not be effective? I am not sure if that makes sense
yep just restart the app after you changed the file, it would reindex the new info

yeah in your case i would use llamaindex with either GPT4 or if you dont want to pay for OpenAI, use ollama and choose a fine-tuned code oriented model from their library (https://ollama.com/library). I don't think it matters that much given the relatively simple nature of the json you want, most "coder" models will nail this part.

there's another way that's more advanced using ranking etc. but i bet it would get pretty close without it, considering how simple your data is
it's similar to checking spelling/grammar, there are 3 main ways i can think of:
  • tell the LLM to fix it
  • upload a reference file for the LLM to fact check
  • sort/rank results from the index, returning the top 1 or 3 etc.
That is really helpful thank you.

Do you suggest any models that I can run locally without waiting hours for a response? I have in my availability an RTX 3060 and an Apple Silicon M1 Pro. The silicon is faster but maybe I should try a lighter model in general. Is phi worth it you think or is it too weak for something like that?
phi isnt that good at anything IME
i almost always use mistral, and lately get pretty far by just uploading docs to fill in gaps lol
i made this video last night to demonstrate how to teach stuff like inventory to an agent
I also tried mistal as well but it is too heavy for my hardware. Do you run it locally?
that is really cool
yep i run it on a pretty standard gaming PC i built in 2018, 16GB of vram
and how long does it usually take for it to answer?
1-3 seconds usually
a simple hi prompt it takes over 15 minutes for me...
maybe I am doing something wrong?
hm how are you running it
through ollama cli?
I think I run dolphin-mixtral-8x7b so I guess these two are not the same haha
yeah that ones bigger, my computer would probably hang on that one forever
mistral is 7B, and latest is 4GB
Okay makes sense now! Thank you so much! Good luck with your projects!
Add a reply
Sign up and join the conversation on Discord