Hi guys, based on your experience which

At a glance

The community members are discussing the performance of two reranking models, Colbert and Cohere, for information retrieval tasks. One community member suggests that Cohere performs better, but notes that neither model is perfect. They share some code they have been experimenting with, which involves using both the FlagEmbeddingReranker and the CohereRerank models, and they seem to get better results this way. However, they mention not having any metrics to evaluate the performance.

Another community member suggests running some evaluation metrics using the Ollama local tool, which can provide insights into the hallucination, QA correctness, and relevance of the retrieved documents. They provide some sample code to demonstrate how to set this up. However, they note that using the OpenAI API for evaluation can be expensive.

The community members also discuss comparing the results of different reranking configurations, such as using the FlagEmbeddingReranker alone, the CohereRerank alone, and using both in combination. They provide sample code to demonstrate this approach.

Finally, one community member notes that in their simple test, the Cohere reranker appears to be about 2-4 times faster than the FlagEmbeddingReranker.

rrichard1861

Hi guys, based on your experience which reranking model is better ? Colbert or Cohere?

8 comments

ssysfor

I think cohere is better but neither were perfect. I started doing the below and I have no idea if this is suggested or recommended but I seem to return better results. It's tough to say since LLMs aren't consistent always when you rerun the same query over and over

Plain Text

def main():
    rag = RagSearch()
    rerank1 = FlagEmbeddingReranker(top_n=4, model="BAAI/bge-reranker-large")
    
rerank = CohereRerank(top_n=4, api_key="<snip")
    index = VectorStoreIndex.from_vector_store(vector_store=rag.vector_store, 
                                               embed_model=Settings.embed_model)
    
    query_engine = index.as_query_engine(llm=Settings.llm,
                                         similarity_top_k=12, 
                                         node_postprocessors=[rerank1, rerank])

I seem to get better results when using both. Not sure if that's by design or not. I have no metrics. Maybe someone that knows more can weight in. I don't notice a performance hit either.

rrichard1861

thanks!

ssysfor

You could try running some metrics. I use Ollama local. With their OpenAPI support you save a lot of money running observability metrics. I just got it working yesterday so i haven't had time to measure the above.

Plain Text

queries_df = get_qa_with_reference(px.Client())
    retrieved_documents_df = get_retrieved_documents(px.Client())
    
    eval_model = OpenAIModel(
        api_key="ollama",
        base_url="http://192.168.0.109:1234/v1/",
        model="<model>",
    )
    
    hallucination_evaluator = HallucinationEvaluator(eval_model)
    qa_correctness_evaluator = QAEvaluator(eval_model)
    relevance_evaluator = RelevanceEvaluator(eval_model)
    
    hallucination_eval_df, qa_correctness_eval_df = run_evals(
                    dataframe=queries_df,
                    evaluators=[hallucination_evaluator, qa_correctness_evaluator],
                    provide_explanation=True,)
    
    relevance_eval_df = run_evals(
            dataframe=retrieved_documents_df,
            evaluators=[relevance_evaluator],
            provide_explanation=True)[0]

    px.Client().log_evaluations(
        SpanEvaluations(eval_name="Hallucination", dataframe=hallucination_eval_df),
        SpanEvaluations(eval_name="QA Correctness", dataframe=qa_correctness_eval_df),
        DocumentEvaluations(eval_name="Relevance", dataframe=relevance_eval_df)
    )

ssysfor

But if you can't run Ollama you could use OpenAI it's just expensive. While testing it cost me like $27 in < 1 hour

ssysfor

you could see here the results for both reranker requests from above i shared

Attachment

ssysfor

Plain Text

    query_engine0 = index.as_query_engine(llm=Settings.llm,
                                         similarity_top_k=12, 
                                         node_postprocessors=[rerank1])

    query_engine1 = index.as_query_engine(llm=Settings.llm,
                                         similarity_top_k=12, 
                                         node_postprocessors=[rerank])

    query_engine2 = index.as_query_engine(llm=Settings.llm,
                                         similarity_top_k=12, 
                                         node_postprocessors=[rerank1, rerank])
                                                                                      
    user_queries = ["""""",
                    """""",
                    """""",
    ]
    
    query_engines = [query_engine0, query_engine1, query_engine2]
    
    responses = rag.query_index(query_engines=query_engines, queries=user_queries )
    
    for response in responses:
        print(response, "\n\n")

^ you could do something like this (easy) to compare the results from a few different use cases/rerankers.

ssysfor

In this simple test, Cohere seems to be about 2-4x faster in rerank fwiw.

rrichard1861

thanks!!

Add a reply

Find answers from the community

Hi guys, based on your experience which