SemanticSimilarityEvaluator
. This evaluator calculates the similarity score between embeddings of the generated answer and the reference answer. from llama_index.core.evaluation import SemanticSimilarityEvaluator evaluator = SemanticSimilarityEvaluator() response = "The sky is typically blue" reference = """The color of the sky can vary depending on several factors, including time of day, weather conditions, and location. During the day, when the sun is in the sky, the sky often appears blue. This is because of a phenomenon called Rayleigh scattering, where molecules and particles in the Earth's atmosphere scatter sunlight in all directions, and blue light is scattered more than other colors because it travels as shorter, smaller waves. This is why we perceive the sky as blue on a clear day. """ result = await evaluator.aevaluate( response=response, reference=reference, ) print("Score: ", result.score) print("Passing: ", result.passing) # default similarity threshold is 0.8
SemanticSimilarityEvaluator
only uses response
and reference
, passing in query does not influence the evaluation. The aevaluate
method is used to calculate the similarity score between the response
and reference
text. The result is an EvaluationResult
object, which includes the similarity score and a boolean indicating whether the score passes the default similarity threshold of 0.8.SemanticSimilarityEvaluator
in LlamaIndex works by comparing the similarity between embeddings of the generated answer and the reference answer. The embeddings are generated using a model specified during the instantiation of the SemanticSimilarityEvaluator
. If no model is specified, it uses a default model from the settings or context.SemanticSimilarityEvaluator
. If no function is specified, it uses a default function based on the specified SimilarityMode
. The default SimilarityMode
is not explicitly mentioned in the provided sources, but it's typically cosine similarity in many text embedding models.class SemanticSimilarityEvaluator(BaseEvaluator): def __init__( self, embed_model: Optional[BaseEmbedding] = None, similarity_fn: Optional[Callable[..., float]] = None, similarity_mode: Optional[SimilarityMode] = None, similarity_threshold: float = 0.8, service_context: Optional[ServiceContext] = None, ) -> None: self._embed_model = embed_model or embed_model_from_settings_or_context( Settings, service_context ) if similarity_fn is None: similarity_mode = similarity_mode or SimilarityMode.DEFAULT self._similarity_fn = lambda x, y: similarity(x, y, mode=similarity_mode) else: if similarity_mode is not None: raise ValueError( "Cannot specify both similarity_fn and similarity_mode" ) self._similarity_fn = similarity_fn self._similarity_threshold = similarity_threshold
aevaluate
method, the evaluator calculates the embeddings for the response and reference, and then calculates the similarity score using the specified or default similarity function:async def aevaluate( self, query: Optional[str] = None, response: Optional[str] = None, contexts: Optional[Sequence[str]] = None, reference: Optional[str] = None, **kwargs: Any, ) -> EvaluationResult: del query, contexts, kwargs # Unused if response is None or reference is None: raise ValueError("Must specify both response and reference") response_embedding = await self._embed_model.aget_text_embedding(response) reference_embedding = await self._embed_model.aget_text_embedding(reference) similarity_score = self._similarity_fn(response_embedding, reference_embedding) passing = similarity_score >= self._similarity_threshold return EvaluationResult( score=similarity_score, passing=passing, feedback=f"Similarity score: {similarity_score}", )
HTMLNodeParser
to transform HTML text into BaseNodes:from bs4 import BeautifulSoup from llama_index.core.node_parser import HTMLNodeParser # Initialize the parser parser = HTMLNodeParser() # Your HTML text text = "<html><body><h1>My First Heading</h1><p>My first paragraph.</p></body></html>" # Parse the text into BeautifulSoup object soup = BeautifulSoup(text, "html.parser") # Extract nodes from the soup html_nodes = parser.get_nodes_from_soup(soup) # Now html_nodes is a list of BaseNodes
get_nodes_from_soup
method is used to extract nodes from the BeautifulSoup object. The result is a list of BaseNodes.SentenceSplitter
to split the text into sentences and each sentence will be a node. Here is an example:from llama_index.core.node_parser import SentenceSplitter from llama_index.core.schema import Document # Your text text = "Test test test" # Create a document from your text document = Document(text=text) # Initialize the parser parser = SentenceSplitter() # Get nodes from the document nodes = parser.get_nodes_from_documents([document]) # Now nodes is a list of BaseNodes
get_nodes_from_documents
method is used to extract nodes from the document. The result is a list of BaseNodes.PairwiseComparisonEvaluator
in LlamaIndex is used to evaluate the quality of a response versus a "reference" response given a question. It does this by having a Language Learning Model (LLM) judge which response is better.from llama_index.core.evaluation import PairwiseComparisonEvaluator from llama_index.llms.openai import OpenAI # Initialize the LLM llm = OpenAI(temperature=0, model="gpt-4") # Initialize the evaluator evaluator = PairwiseComparisonEvaluator(llm=llm) # Your question, response, and reference response question = "What is the capital of France?" response = "The capital of France is Paris." reference_response = "France's capital is Paris." # Evaluate the response result = await evaluator.aevaluate( query=question, response=response, second_response=reference_response, ) print("Score: ", result.score) print("Passing: ", result.passing) # default threshold is 0.5
aevaluate
method is used to evaluate the response. The result is an EvaluationResult
object, which includes the score and a boolean indicating whether the score passes the default threshold of 0.5.PairwiseComparisonEvaluator
example is not necessarily low. The score represents the probability that the LLM judges the first response (the one being evaluated) to be better than the second response (the reference). PairwiseComparisonEvaluator
means that the Language Learning Model (LLM) judges the second response (the reference) to be better than the first response (the one being evaluated). PairwiseComparisonEvaluator
, the score represents the probability that the LLM judges the first response to be better than the second response. Therefore, a score of 0 means the LLM judges the first response to be worse in all cases when compared to the second response.PairwiseComparisonEvaluator
means that the Language Learning Model (LLM) judges the first response (the one being evaluated) to be better than the second response (the reference) in all cases.PairwiseComparisonEvaluator
, the score represents the probability that the LLM judges the first response to be better than the second response. Therefore, a score of 1 means the LLM judges the first response to be superior in all cases when compared to the second response.