Skip to content

Add self-contrast framework, V5 tools baseline, and benchmark artifacts#65

Merged
Ali-Meh619 merged 3 commits into
mainfrom
SC_ALI
Mar 27, 2026
Merged

Add self-contrast framework, V5 tools baseline, and benchmark artifacts#65
Ali-Meh619 merged 3 commits into
mainfrom
SC_ALI

Conversation

@Ali-Meh619
Copy link
Copy Markdown
Contributor

PR Type

[Feature | Fix | Documentation | Other() ]
Feature

Short Description

Adds a new src/task_solver/self_contrast/ package with five runnable variants for benchmark evaluation: base self-contrast, single-agent baseline, tools-integrated self-contrast, tool-perspective self-contrast, and a new single-agent + tools baseline.

This PR also includes the XFinBench dataset snapshots used for evaluation, saved result files for V1-V5 runs, a comparison summary JSON, an interactive comparison dashboard, and a README that documents setup, execution, evaluation variants, and experimental results.

To keep the self-contrast package usable in environments without optional dependencies, it adds defensive import guards in src/task_solver/init.py and src/utils/init.py.

Tests Added

Added tests/src/task_solver/self_contrast/test_solver.py.

The tests cover:

skipping the contrast phase when all perspectives already agree locally
retrying tool-enabled solving when the model response does not include a runnable code block on the first try
merging numerically equivalent decimal and percentage answers into the same majority-vote bucket

Introduce the self-contrast evaluation package with five runnable variants, dataset snapshots, tests, and comparison outputs so the experiments can be reviewed and reproduced from the repo. Add import guards in task solver and utils so the self-contrast code can run without optional autogen or datasets dependencies.

Made-with: Cursor
Bring SC_ALI up to date with main before opening the self-contrast pull request. Preserve the self-contrast framework and keep optional utility imports guarded so the self-contrast package remains usable without extra evaluation dependencies.

Made-with: Cursor
@Ali-Meh619 Ali-Meh619 requested a review from afkanpour March 27, 2026 18:05
Resolve the self-contrast pre-commit failures by normalizing tracked dataset and result artifacts, tightening typing in the model client and solver helpers, and making optional package exports degrade gracefully when extra evaluation dependencies are missing.

Made-with: Cursor
@Ali-Meh619 Ali-Meh619 merged commit 369c994 into main Mar 27, 2026
2 checks passed
@Ali-Meh619 Ali-Meh619 deleted the SC_ALI branch March 27, 2026 19:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants