Running evaluations, but getting

At a glance

Running evaluations, but getting BadRequestError due to maximum context length is 8192 tokens....

The evals appear to be running against the entire context all at once instead of in chunks.

Specifically:

Plain Text

relevancy_result = judges["relevancy"].evaluate(
    query=example.query,
    response=prediction.response,
    contexts=prediction.contexts,
)

Runs against this example query/response/context combo:

Plain Text

Query tokens: 14
Response tokens: 141
Context tokens: 38395
Remaining tokens: -30358

8 comments

LLogan M

so for this eval to work properly, it kind if needs to see all source chunks at once (I don't think we've figured out an approach to chunking that evaluation)

JJoshhhh

I think it's a bug. I'm using gpt-4-1106-preview which accepts 128K tokens, but for some reason the evals are calling llama_index.embeddings.openai.aget_embedding (why?), which is why it's barfing at context >8k

Plain Text

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 400 Bad Request"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 400 Bad Request"
WARNING:llama_index.llms.openai_utils:Retrying llama_index.embeddings.openai.aget_embedding in 0.7456712300709634 seconds as it raised BadRequestError: Error code: 400 - {'error': {'message': "This model's maximum context length is 8192 tokens, however you requested 8825 tokens (8825 in your prompt; 0 for the completion). Please reduce your prompt; or completion length.", 'type': 'invalid_request_error', 'param': None, 'code': None}}.
Retrying llama_index.embeddings.openai.aget_embedding in 0.7456712300709634 seconds as it raised BadRequestError: Error code: 400 - {'error': {'message': "This model's maximum context length is 8192 tokens, however you requested 8825 tokens (8825 in your prompt; 0 for the completion). Please reduce your prompt; or completion length.", 'type': 'invalid_request_error', 'param': None, 'code': None}}.

Attachments

LLogan M

Are you sure its the relevancy evalutor that is barfing? Only the semantic similarity evaluator should be calling embeddings

LLogan M

Looking at the source code, the relevancy evaluator makes zero embedding calls

JJoshhhh

Oh, my fault. I thought the FaithfulnessEvaluator prompt was also calling embeddings. Removing SemanticSimilarityEvaluator removed the error 👍

JJoshhhh

Same goes for the RelevancyEvaluator

JJoshhhh

Oh I see the issue. I was following this notebook: https://github.com/run-llama/llama_index/blob/main/docs/examples/llama_dataset/downloading_llama_datasets.ipynb

It's running SemanticSimilarityEvaluator on the contexts instead of the responses:

Plain Text

semantic_similarity_result = judges["semantic_similarity"].evaluate(
    query=example.query,
    response="\n".join(prediction.contexts),
    reference="\n".join(example.reference_contexts),
)

The contexts can be quite large, and really only make sense to run on an context-by-context basis (imo). I'm trying to evaluate responses anyways.

Changing it to this worked as expected:

Plain Text

semantic_similarity_result = await judges["semantic_similarity"].aevaluate(
    query=example.query,
    response=prediction.response,
    reference=example.reference_answer,
)

JJoshhhh

This doc has it right: https://docs.llamaindex.ai/en/stable/examples/evaluation/semantic_similarity_eval.html#embedding-similarity-evaluator

Add a reply

Find answers from the community

Running evaluations, but getting