Benchmark Evaluation by Negiiiin · Pull Request #66 · VectorInstitute/automated_capability_evaluation

Negiiiin · 2026-04-14T16:47:35Z

PR Type

Feature

Short Description

This is used to do the benchmark evaluation on both static benchmarks and our own generated benchmarks using open source models

This change is

Made-with: Cursor

saidul-islam98 · 2026-04-14T19:31:25Z

+
+    is_mcq = bool(re.search(r"(?im)^\s*options\s*:\s*$", str(task.get("input", ""))))
+    if is_mcq:
+        answer_instruction = (


Should we add something like the following lines in the prompt?

"You are a expert multiple-choice question solver"

"The answer choice must be one of: A, B, C, D, or E".

No, that doesn't add much to the prompt.

afkanpour

@afkanpour reviewed 1 file and all commit messages, and made 8 comments.
Reviewable status: 1 of 33 files reviewed, 8 unresolved discussions (waiting on Negiiiin and saidul-islam98).

scripts/flatten_inspect_logs.py line 90 at r2 (raw file):

    accuracy = (num_correct / num_samples) if num_samples else 0.0
    # In this binary setting with grades only, we treat F1 as equal to accuracy.
    f1 = accuracy

This is very confusing. If we don't calculate F1 here, why set it equal to accuracy? This could cause problems if someone who's unaware of this assignment uses the output json.

Code quote:

f1 = accuracy

src/eval_stages/stage1_eval_execution.py line 42 at r2 (raw file):

def _inspect_judge_model(judge_llm: Dict[str, Any]) -> Union[str, InspectModel]:
    """Resolve judge for Inspect so it can use a different API base than the subject."""
    provider = str(judge_llm.get("provider", "openai"))

What will happen here if we can't use an openai model? For example, if we don't have quota?
What if we want to use a non-openai judge model? Can we specify this in a config?

Code quote:

provider = str(judge_llm.get("provider", "openai"))

src/eval_stages/stage1_local_eval_execution.py line 81 at r2 (raw file):

    num_incorrect = sum(1 for row in rows if row.get("grade") == "I")
    accuracy = (num_correct / num_samples) if num_samples else 0.0
    f1 = accuracy

This is very confusing. If we don't calculate F1 here, why set it equal to accuracy? This could cause problems if someone who's unaware of this assignment uses the output json.

Code quote:

f1 = accuracy

src/eval_stages/stage1_local_eval_execution.py line 105 at r2 (raw file):

        prompt = str(task["input"])

    is_mcq = bool(re.search(r"(?im)^\s*options\s*:\s*$", str(task.get("input", ""))))

This is not a robust way to determine mcq. Can we add a problem_type field to the problem json that indicates a mcq?

Code quote:

is_mcq

src/eval_stages/stage1_local_eval_execution.py line 108 at r2 (raw file):

    if is_mcq:
        answer_instruction = (
            "\n\nReason briefly and do not repeat yourself. Stop immediately after the final "

Why should we ask the model to do these?

Code quote:

Reason briefly and do not repeat yourself.

src/eval_stages/stage1_local_eval_execution.py line 116 at r2 (raw file):

    else:
        answer_instruction = (
            "\n\nReason briefly and do not repeat yourself. Stop immediately after the final "

Same comment as above.

Code quote:

Reason briefly and do not repeat yourself.

src/eval_stages/stage1_local_eval_execution.py line 345 at r2 (raw file):

    if not text:
        return None
    m = re.search(r"[-+]?\d[\d,]*(?:\.\d+)?", text)

This doesn't seem to match .5
Can we replace with
m = re.search(r"[-+]?(?:\d[\d,]*(?:.\d+)?|.\d+)", text)

Code quote:

m = re.search(r"[-+]?\d[\d,]*(?:\.\d+)?", text)

afkanpour · 2026-04-15T14:45:59Z

+
+    is_mcq = bool(re.search(r"(?im)^\s*options\s*:\s*$", str(task.get("input", ""))))
+    if is_mcq:
+        answer_instruction = (


No, that doesn't add much to the prompt.

afkanpour

@afkanpour reviewed 1 file and resolved 3 discussions.
Reviewable status: 2 of 33 files reviewed, 5 unresolved discussions (waiting on Negiiiin and saidul-islam98).

afkanpour

@afkanpour reviewed 31 files and all commit messages, and made 1 comment.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on Negiiiin and saidul-islam98).

Negiiiin added 4 commits March 17, 2026 01:05

Added static benchmarks

80a7cd5

New Finance Benchmarks

3d13bd1

WIP

93cb82e

Made-with: Cursor

updated scripts

a294dd4

Negiiiin requested review from afkanpour and saidul-islam98 April 14, 2026 16:48

Removed unnecessary logic from evaluation step 1

287dbb2

saidul-islam98 reviewed Apr 14, 2026

View reviewed changes

afkanpour requested a review from saidul-islam98 April 15, 2026 14:45

afkanpour requested changes Apr 15, 2026

View reviewed changes

Negiiiin added 2 commits April 15, 2026 15:56

changed prompt

b8c479e

Made some changes

4543b0f

afkanpour reviewed Apr 16, 2026

View reviewed changes

afkanpour approved these changes Apr 16, 2026

View reviewed changes

FinKnow

6f73c9d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark Evaluation#66

Benchmark Evaluation#66
Negiiiin wants to merge 8 commits into
mainfrom
static_evaluation

Negiiiin commented Apr 14, 2026 •

edited by afkanpour

Loading

Uh oh!

saidul-islam98 Apr 14, 2026

Uh oh!

afkanpour Apr 15, 2026

Uh oh!

afkanpour left a comment

Uh oh!

afkanpour Apr 15, 2026

Uh oh!

afkanpour left a comment

Uh oh!

afkanpour left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Negiiiin commented Apr 14, 2026 • edited by afkanpour Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Type

Short Description

Uh oh!

saidul-islam98 Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

afkanpour Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

afkanpour left a comment

Choose a reason for hiding this comment

Uh oh!

afkanpour Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

afkanpour left a comment

Choose a reason for hiding this comment

Uh oh!

afkanpour left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Negiiiin commented Apr 14, 2026 •

edited by afkanpour

Loading