Similarity cutoff

YYasmine

Hey, when I add response_synthesizer to RetrieverQueryEngine to filter chunks by similarity_cutoff, I always get None, and checked the results of retriever.retrieve(question) and there are always chunks returned

30 comments

LLogan M

Looking into this right now, forgot to answer this yesterday lol

YYasmine

It's okay, thanks! 😅

LLogan M

Hmm, this appears to be working for me, but using the default local index.

Maybe I need to setup weaviate for a proper test

LLogan M

Plain Text

from llama_index import SimpleDirectoryReader, VectorStoreIndex
from llama_index.indices.postprocessor import SimilarityPostprocessor

documents = SimpleDirectoryReader("./paul_graham").load_data()

index = VectorStoreIndex.from_documents(documents)

query_engine = index.as_query_engine(
    node_postprocessors=[
        SimilarityPostprocessor(similarity_cutoff=0.8)
    ]
)

retriever = index.as_retriever()

# set a breakpoint to inspect
import pdb
pdb.set_trace()
nodes = retriever.retrieve("What did the author do growing up?")

response = query_engine.query("What did the author do growing up?")

print("Done!")

YYasmine

Hey, I tested your code and it's working except for embedding=None
So yeah I think it's related to the other vector stores including weaviate not the default one
I put the tests I did here : https://github.com/YasmineMh/Llamaindex-similarity/tree/main

YYasmine

not related question, what is the difference between VectorStoreIndex and GPTVectorStoreIndex ? 😅

LLogan M

They are the same haha we just recently dropped the gpt prefix (but need to maintain backwards compatability)

LLogan M

Thanks! I will check this out

YYasmine

thanks!

YYasmine

Hey, I was wondering if you had a chance to take a look at this?

LLogan M

Not yet 😦 It's definitely still on my list though!

YYasmine

I appreciate it, thanks!

LLogan M

ok, was finally able to find time to run this haha and inded, both the retriever and the query set the score=None at some point... but! Now I can at least step through the code and figure out why 🙂

LLogan M

ok two issues

LlamaIndex removes the embedding vector when converting the weaviate result back into the Node class
The score is none, because weaviate isn't returning the score (and therefore, the similarity cutoff also won't function well)

1 is easy enough to fix. For 2, I'll see if weaviate can return the similarity score

LLogan M

PR is out @Yasmine https://github.com/jerryjliu/llama_index/pull/6512

YYasmine

Thank you so much!
for 1. I think it's the default behaviour not just for weaviate
this PR is for both 1. and 2. ?

YYasmine

for similarity maybe we need to calculate it as following ? https://weaviate.io/developers/weaviate/more-resources/faq#q-how-do-i-get-the-cosine-similarity-from-weaviates-certainty

LLogan M

This PR fixes both actually!

LLogan M

The only thing to note is that their default similarity metric is a little different than ours. In my testing, the similarity scores were a little lower (~0.2 for the classic paul graham queries)

YYasmine

Thank you!

YYasmine

Hey, I tested the similarity and I think we need to update the score , I commented about it here : https://github.com/jerryjliu/llama_index/pull/6512

YYasmine

I think if you test with the new score you'll get as well ~0.2 for the classic paul graham queries

YYasmine

can you please test with 1-distance and let me know?

LLogan M

hmmm, that kind of makes sense! Since 0 means identical (whooops, missed that)

But on the weaviate docs, they say that distance can range from 0-2? I wonder how that works lol

Judging by the definition, 1 - distance makes sense though

Attachment

LLogan M

should be (1 - distance) / 2 ..... I think

YYasmine

actually 1-distance/2 is certainty (normalised distance on a scale of 0-1) and from here https://weaviate.io/developers/weaviate/more-resources/faq#q-how-do-i-get-the-cosine-similarity-from-weaviates-certainty cosine_sim = 2*certainty - 1 (if you replace certainty here with 1-distance/2) you'll get 1-distance (this is my interpretation tho 😅 )
so, in my opinion there are two options:

use certainty instead of distance and the score will be 2 * certainty - 1
use distance (as you did) and the score will be 1-distance

LLogan M

ohhhh ok that that makes more sense

LLogan M

ok, I can switch it to 1 - distance then

LLogan M

https://github.com/jerryjliu/llama_index/pull/6545

Will merge once the checks are done. Sorry about the confusion there lol

YYasmine

no problem, thank you !

Add a reply

Find answers from the community

Similarity cutoff