Find answers from the community

Updated 10 months ago

hello community ! I have a more generic

hello community ! I have a more generic question: we are using documentation that is divided in roughly 60% technical documentation and 40% more marketing oriented documentation. Currently we're using a single index for everything.
In the past days i've been trying to finetune an embedding model with roughly 8k synthethic data generated from the 60% technical documentation mentioned above. What seems to happen now is that there is a tendency for the retriever to favor more marketing documents which is not necessarily something that I want.
My questions are this:
  1. is this bad practice to keep all documentation in the same index?
  2. does it make sense if (1) is not so terrible to increase the top-k and introduce a reranker?
  3. if (1) is indeed bad practice, what are the recommendations?
V
1 comment
  1. Yes, if it starts to impede you getting work done. No, if you aren't bothered by it a single bit, which apparently isn't the case.
  2. (sounds kinda hacky to me)
  3. To split tech and marketing materials, you can ask a LLM to classify a given document whether it's tech or marketing material while feeding each document through it. Depending on the response, add that document to the tech or the market index.
Add a reply
Sign up and join the conversation on Discord