Find answers from the community

Updated 2 years ago

Similarity cutoff

Hey, when I add response_synthesizer to RetrieverQueryEngine to filter chunks by similarity_cutoff, I always get None, and checked the results of retriever.retrieve(question) and there are always chunks returned
L
Y
30 comments
Looking into this right now, forgot to answer this yesterday lol
It's okay, thanks! πŸ˜…
Hmm, this appears to be working for me, but using the default local index.

Maybe I need to setup weaviate for a proper test
Plain Text
from llama_index import SimpleDirectoryReader, VectorStoreIndex
from llama_index.indices.postprocessor import SimilarityPostprocessor

documents = SimpleDirectoryReader("./paul_graham").load_data()

index = VectorStoreIndex.from_documents(documents)

query_engine = index.as_query_engine(
    node_postprocessors=[
        SimilarityPostprocessor(similarity_cutoff=0.8)
    ]
)

retriever = index.as_retriever()

# set a breakpoint to inspect
import pdb
pdb.set_trace()
nodes = retriever.retrieve("What did the author do growing up?")

response = query_engine.query("What did the author do growing up?")

print("Done!")
Hey, I tested your code and it's working except for embedding=None
So yeah I think it's related to the other vector stores including weaviate not the default one
I put the tests I did here : https://github.com/YasmineMh/Llamaindex-similarity/tree/main
not related question, what is the difference between VectorStoreIndex and GPTVectorStoreIndex ? πŸ˜…
They are the same haha we just recently dropped the gpt prefix (but need to maintain backwards compatability)
Thanks! I will check this out
Hey, I was wondering if you had a chance to take a look at this?
Not yet 😦 It's definitely still on my list though!
I appreciate it, thanks!
ok, was finally able to find time to run this haha and inded, both the retriever and the query set the score=None at some point... but! Now I can at least step through the code and figure out why πŸ™‚
ok two issues

  1. LlamaIndex removes the embedding vector when converting the weaviate result back into the Node class
  2. The score is none, because weaviate isn't returning the score (and therefore, the similarity cutoff also won't function well)
1 is easy enough to fix. For 2, I'll see if weaviate can return the similarity score
Thank you so much!
for 1. I think it's the default behaviour not just for weaviate
this PR is for both 1. and 2. ?
This PR fixes both actually!
The only thing to note is that their default similarity metric is a little different than ours. In my testing, the similarity scores were a little lower (~0.2 for the classic paul graham queries)
Hey, I tested the similarity and I think we need to update the score , I commented about it here : https://github.com/jerryjliu/llama_index/pull/6512
I think if you test with the new score you'll get as well ~0.2 for the classic paul graham queries
can you please test with 1-distance and let me know?
hmmm, that kind of makes sense! Since 0 means identical (whooops, missed that)

But on the weaviate docs, they say that distance can range from 0-2? I wonder how that works lol

Judging by the definition, 1 - distance makes sense though
Attachment
image.png
should be (1 - distance) / 2 ..... I think
actually 1-distance/2 is certainty (normalised distance on a scale of 0-1) and from here https://weaviate.io/developers/weaviate/more-resources/faq#q-how-do-i-get-the-cosine-similarity-from-weaviates-certainty cosine_sim = 2*certainty - 1 (if you replace certainty here with 1-distance/2) you'll get 1-distance (this is my interpretation tho πŸ˜… )
so, in my opinion there are two options:
  • use certainty instead of distance and the score will be 2 * certainty - 1
  • use distance (as you did) and the score will be 1-distance
ohhhh ok that that makes more sense
ok, I can switch it to 1 - distance then
https://github.com/jerryjliu/llama_index/pull/6545

Will merge once the checks are done. Sorry about the confusion there lol
no problem, thank you !
Add a reply
Sign up and join the conversation on Discord