Hello everyone, I am new to developing, finetuning and training llms. So for the last week I am going through llamaIndex and openai docs. I would really want to discuss and hear more experienced opinions on my matter.
The problem I want to solve is a type of classification I think. We have a large database of products created by different stores. Each store can create a product and they are free to use whatever name they want. There are some products that could be duplicate across stores but have slightly or entirely different names that a human would be able to identify and match them. So for example a Coca Cola 500ml in one store could be named Coca Cola medium in another one.
In my opinion LLMs are the best way to recognize such variations. (I would like to hear why this might not be a good idea.)
And I am trying to figure out the best way to provide a model with my huge data collection (about 50k tokens based on the openai calculator). Instruct the model accordingly and guide it to return the entire data collection this time grouped. Basically to create one product prototype for each product of every store and list all possible variations below.
Any ideas on how I should proceed?
I already had some code experimentations that gives me some promising results but it is not entirely reliable since the data size is huge and are getting separated into smaller prompts and the classification process is not working as expected. I am thinking after each response to feed the result back to the model finetune it and in case a product matches a previous existing result to include it there.
I would love to give more details if someone is interested.