Hi , I would like to understand the cosine similarity scores. I have tried to replicate the scores using cosine similarity fn and dot product, but I am getting different numbers for the same chunk of text
In the image : scores_llama is similarity score from each node for 10 different nodes using node.score method
llama_cosine and llama_dot are calculated using simple np functions
from numpy import dot
from numpy.linalg import norm
def cosine_similarity(A, B):
return dot(A, B) / (norm(A) * norm(B))
def dot_product(A, B):
return dot(A, B)
cosine similarity and dot product value differs from llama-index score (I have not changed the default similarity method)
Acc to the code it should be the same, but it is not
def similarity(
embedding1: EMB_TYPE,
embedding2: EMB_TYPE,
mode: SimilarityMode = SimilarityMode.DEFAULT,
) -> float:
"""Get embedding similarity."""
if mode == SimilarityMode.EUCLIDEAN:
# Using -euclidean distance as similarity to achieve same ranking order
return -float(np.linalg.norm(np.array(embedding1) - np.array(embedding2)))
elif mode == SimilarityMode.DOT_PRODUCT:
product = np.dot(embedding1, embedding2)
return product
else:
product = np.dot(embedding1, embedding2)
norm = np.linalg.norm(embedding1) * np.linalg.norm(embedding2)
return product / norm
Would like to understand the issue here