hello community ! I have a more generic question: we are using documentation that is divided in roughly 60% technical documentation and 40% more marketing oriented documentation. Currently we're using a single index for everything.
In the past days i've been trying to finetune an embedding model with roughly 8k synthethic data generated from the 60% technical documentation mentioned above. What seems to happen now is that there is a tendency for the retriever to favor more marketing documents which is not necessarily something that I want.
My questions are this:
- is this bad practice to keep all documentation in the same index?
- does it make sense if (1) is not so terrible to increase the top-k and introduce a reranker?
- if (1) is indeed bad practice, what are the recommendations?