Find answers from the community

Updated 2 years ago

Hi I would like to understand the cosine

At a glance

The community member is trying to understand why the cosine similarity scores they calculated using numpy functions differ from the scores obtained using the llama-index library. They have replicated the code and used the same embedding model and numpy versions, but the results are not the same.

The community members discuss potential issues, such as accessing the correct node vectors and metadata, and provide suggestions on how to investigate the problem further. They mention that the difference in scores may not be due to rounding errors and could be significant, especially when dealing with highly similar data.

One community member suggests that the community member may be using the wrong way to access the text for embedding, as the llama-index library includes metadata by default. They provide a code snippet to check the text that the embedding model is using.

Another community member tries to replicate the code shared by the original community member but encounters an attribute error. They provide a correction for the code and suggest looking up the object in the codebase and using IDE typing hints to help debug the issue.

There is no explicitly marked answer in the comments.

Useful resources
Hi , I would like to understand the cosine similarity scores. I have tried to replicate the scores using cosine similarity fn and dot product, but I am getting different numbers for the same chunk of text
In the image : scores_llama is similarity score from each node for 10 different nodes using node.score method
llama_cosine and llama_dot are calculated using simple np functions
Plain Text
from numpy import dot
from numpy.linalg import norm

def cosine_similarity(A, B):
    return dot(A, B) / (norm(A) * norm(B))

def dot_product(A, B):
    return dot(A, B)

cosine similarity and dot product value differs from llama-index score (I have not changed the default similarity method)
Acc to the code it should be the same, but it is not
Plain Text
def similarity(
    embedding1: EMB_TYPE,
    embedding2: EMB_TYPE,
    mode: SimilarityMode = SimilarityMode.DEFAULT,
) -> float:
    """Get embedding similarity."""
    if mode == SimilarityMode.EUCLIDEAN:
        # Using -euclidean distance as similarity to achieve same ranking order
        return -float(np.linalg.norm(np.array(embedding1) - np.array(embedding2)))
    elif mode == SimilarityMode.DOT_PRODUCT:
        product = np.dot(embedding1, embedding2)
        return product
    else:
        product = np.dot(embedding1, embedding2)
        norm = np.linalg.norm(embedding1) * np.linalg.norm(embedding2)
        return product / norm

Would like to understand the issue here
Attachment
image.png
L
A
7 comments
How did you run the replication? Did you access the same node vectors? Or something else

Saw your issue on github. Its a pretty small difference, but large enough to not be rounding errors.

Did you use the same numpy versions when replicating?
I accessed the nodes using node.source_nodes and used same embedding model (text-embedding-ada-002) to get the vectors.
The numpy versions are the same.
The scores in the image might have smaller difference, but they are significant when most data we have is highly similar and we are trying to get accurate answers.
Here's another example where cosine similarity for top 10 nodes vary widely, and the top-k nodes are not the best nodes (answers are not optimal using llama-indexes cosine similarity)
Attachment
image.png
When you embed, do you use node.text to get the text?

That's actually the wrong way, since under the hood, metadata is actually included when sending data to the embedding model

By default all metadata is sent to the embedding model, but the attribute excluded_embed_metadata_keys excludes certain keys from your metadata

To check what text the embedding model uses, you can use this

Plain Text
from llama_index.schema import MetadataMode

print(node.get_content(metadata_mode=MetadataMode.EMBED))
I may or may not have spelled a few things wrong there LOL on my phone
I am trying to replicate the code shared using below
Plain Text
from llama_index.schema import MetadataMode
for node in response.source_nodes:
    print(node.get_content(metadata_mode=MetadataMode.EMBED))

getting the below error
Plain Text
AttributeError                            Traceback (most recent call last)
Cell In[157], line 3
      1 from llama_index.schema import MetadataMode
      2 for node in response.source_nodes:
----> 3     print(node.get_content(metadata_mode=MetadataMode.EMBED))

AttributeError: 'NodeWithScore' object has no attribute 'get_content'

What am I doing wrong?
whoops typo on my part, forgot you were using source nodes here

node.node.get_content(...)
Pro dev tip -- when attribute errors like this happen, I always just look up the object in the codebase
https://github.com/jerryjliu/llama_index/blob/3506143d5aedafa91437a4f4097bceb3a4c9ab6f/llama_index/schema.py#L334

IDE typing hints also help a lot too
Add a reply
Sign up and join the conversation on Discord