Skip to content

Benchmark Evaluation#66

Open
Negiiiin wants to merge 8 commits into
mainfrom
static_evaluation
Open

Benchmark Evaluation#66
Negiiiin wants to merge 8 commits into
mainfrom
static_evaluation

Conversation

@Negiiiin
Copy link
Copy Markdown
Collaborator

@Negiiiin Negiiiin commented Apr 14, 2026

PR Type

Feature

Short Description

This is used to do the benchmark evaluation on both static benchmarks and our own generated benchmarks using open source models


This change is Reviewable


is_mcq = bool(re.search(r"(?im)^\s*options\s*:\s*$", str(task.get("input", ""))))
if is_mcq:
answer_instruction = (
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add something like the following lines in the prompt?

  • "You are a expert multiple-choice question solver"
  • "The answer choice must be one of: A, B, C, D, or E".

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that doesn't add much to the prompt.

Copy link
Copy Markdown
Collaborator

@afkanpour afkanpour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@afkanpour reviewed 1 file and all commit messages, and made 8 comments.
Reviewable status: 1 of 33 files reviewed, 8 unresolved discussions (waiting on Negiiiin and saidul-islam98).


scripts/flatten_inspect_logs.py line 90 at r2 (raw file):

    accuracy = (num_correct / num_samples) if num_samples else 0.0
    # In this binary setting with grades only, we treat F1 as equal to accuracy.
    f1 = accuracy

This is very confusing. If we don't calculate F1 here, why set it equal to accuracy? This could cause problems if someone who's unaware of this assignment uses the output json.

Code quote:

f1 = accuracy

src/eval_stages/stage1_eval_execution.py line 42 at r2 (raw file):

def _inspect_judge_model(judge_llm: Dict[str, Any]) -> Union[str, InspectModel]:
    """Resolve judge for Inspect so it can use a different API base than the subject."""
    provider = str(judge_llm.get("provider", "openai"))

What will happen here if we can't use an openai model? For example, if we don't have quota?
What if we want to use a non-openai judge model? Can we specify this in a config?

Code quote:

provider = str(judge_llm.get("provider", "openai"))

src/eval_stages/stage1_local_eval_execution.py line 81 at r2 (raw file):

    num_incorrect = sum(1 for row in rows if row.get("grade") == "I")
    accuracy = (num_correct / num_samples) if num_samples else 0.0
    f1 = accuracy

This is very confusing. If we don't calculate F1 here, why set it equal to accuracy? This could cause problems if someone who's unaware of this assignment uses the output json.

Code quote:

f1 = accuracy

src/eval_stages/stage1_local_eval_execution.py line 105 at r2 (raw file):

        prompt = str(task["input"])

    is_mcq = bool(re.search(r"(?im)^\s*options\s*:\s*$", str(task.get("input", ""))))

This is not a robust way to determine mcq. Can we add a problem_type field to the problem json that indicates a mcq?

Code quote:

is_mcq

src/eval_stages/stage1_local_eval_execution.py line 108 at r2 (raw file):

    if is_mcq:
        answer_instruction = (
            "\n\nReason briefly and do not repeat yourself. Stop immediately after the final "

Why should we ask the model to do these?

Code quote:

Reason briefly and do not repeat yourself.

src/eval_stages/stage1_local_eval_execution.py line 116 at r2 (raw file):

    else:
        answer_instruction = (
            "\n\nReason briefly and do not repeat yourself. Stop immediately after the final "

Same comment as above.

Code quote:

Reason briefly and do not repeat yourself.

src/eval_stages/stage1_local_eval_execution.py line 345 at r2 (raw file):

    if not text:
        return None
    m = re.search(r"[-+]?\d[\d,]*(?:\.\d+)?", text)

This doesn't seem to match .5
Can we replace with
m = re.search(r"[-+]?(?:\d[\d,]*(?:.\d+)?|.\d+)", text)

Code quote:

m = re.search(r"[-+]?\d[\d,]*(?:\.\d+)?", text)


is_mcq = bool(re.search(r"(?im)^\s*options\s*:\s*$", str(task.get("input", ""))))
if is_mcq:
answer_instruction = (
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that doesn't add much to the prompt.

Copy link
Copy Markdown
Collaborator

@afkanpour afkanpour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@afkanpour reviewed 1 file and resolved 3 discussions.
Reviewable status: 2 of 33 files reviewed, 5 unresolved discussions (waiting on Negiiiin and saidul-islam98).

Copy link
Copy Markdown
Collaborator

@afkanpour afkanpour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

@afkanpour reviewed 31 files and all commit messages, and made 1 comment.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on Negiiiin and saidul-islam98).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants