Question - I have observed the similarity score difference between a valid answer and invalid answer is not as large as I was expecting. For example - I asked a question "what are the pricing plans" the vector search comes up with 0.76 score and when I ask "what are your birthday party plans" the similarity score is 0.67 . I believe both of these questions are talking about "plans" is that reason there is so little difference between similarity score.
The similarity score thresholds are not absolute and are highly dependent on which embeddings model you're using . so a score of 0.67 might mean not very similar when using embeddings model 1, while 0.67 might mean very similar when using embeddings model 2. so you need to learn the specific threshold of the embeddings model you're using. Also consider introducing a reranker to your pipeline to improve the performance.
rerankers are mostly cross-encoder models. so without a reranker what you are doing is embedding the docs and the query separately regardless of any relations between them then you are comparing those separately genrated embeddings (which have already lost some of the meaning because embeddings compress information) while cross encoders take two inputs (the query adn similar docs) and uses a classifier layer to see which similar doc is comparatively the most relevant to the original query by comparing pairs of (original query, similar chunk). dunno if this makes sense
i used to use an open source embedding but running it locally was an additional cost. Are there any free or low cost mebedding apis available that are better than ada