Add self-contrast framework, V5 tools baseline, and benchmark artifacts by Ali-Meh619 · Pull Request #65 · VectorInstitute/automated_capability_evaluation

Ali-Meh619 · 2026-03-27T18:05:46Z

PR Type

[Feature | Fix | Documentation | Other() ]
Feature

Short Description

Adds a new src/task_solver/self_contrast/ package with five runnable variants for benchmark evaluation: base self-contrast, single-agent baseline, tools-integrated self-contrast, tool-perspective self-contrast, and a new single-agent + tools baseline.

This PR also includes the XFinBench dataset snapshots used for evaluation, saved result files for V1-V5 runs, a comparison summary JSON, an interactive comparison dashboard, and a README that documents setup, execution, evaluation variants, and experimental results.

To keep the self-contrast package usable in environments without optional dependencies, it adds defensive import guards in src/task_solver/init.py and src/utils/init.py.

Tests Added

Added tests/src/task_solver/self_contrast/test_solver.py.

The tests cover:

skipping the contrast phase when all perspectives already agree locally
retrying tool-enabled solving when the model response does not include a runnable code block on the first try
merging numerically equivalent decimal and percentage answers into the same majority-vote bucket

Introduce the self-contrast evaluation package with five runnable variants, dataset snapshots, tests, and comparison outputs so the experiments can be reviewed and reproduced from the repo. Add import guards in task solver and utils so the self-contrast code can run without optional autogen or datasets dependencies. Made-with: Cursor

Bring SC_ALI up to date with main before opening the self-contrast pull request. Preserve the self-contrast framework and keep optional utility imports guarded so the self-contrast package remains usable without extra evaluation dependencies. Made-with: Cursor

Resolve the self-contrast pre-commit failures by normalizing tracked dataset and result artifacts, tightening typing in the model client and solver helpers, and making optional package exports degrade gracefully when extra evaluation dependencies are missing. Made-with: Cursor

Ali-Meh619 added 2 commits March 27, 2026 10:35

merge main into SC_ALI

e6b358a

Bring SC_ALI up to date with main before opening the self-contrast pull request. Preserve the self-contrast framework and keep optional utility imports guarded so the self-contrast package remains usable without extra evaluation dependencies. Made-with: Cursor

Ali-Meh619 requested a review from afkanpour March 27, 2026 18:05

afkanpour approved these changes Mar 27, 2026

View reviewed changes

Ali-Meh619 merged commit 369c994 into main Mar 27, 2026
2 checks passed

Ali-Meh619 deleted the SC_ALI branch March 27, 2026 19:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add self-contrast framework, V5 tools baseline, and benchmark artifacts#65

Add self-contrast framework, V5 tools baseline, and benchmark artifacts#65
Ali-Meh619 merged 3 commits into
mainfrom
SC_ALI

Ali-Meh619 commented Mar 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Ali-Meh619 commented Mar 27, 2026

PR Type

Short Description

Tests Added

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants