Add self-contrast framework, V5 tools baseline, and benchmark artifacts#65
Merged
Conversation
Introduce the self-contrast evaluation package with five runnable variants, dataset snapshots, tests, and comparison outputs so the experiments can be reviewed and reproduced from the repo. Add import guards in task solver and utils so the self-contrast code can run without optional autogen or datasets dependencies. Made-with: Cursor
Bring SC_ALI up to date with main before opening the self-contrast pull request. Preserve the self-contrast framework and keep optional utility imports guarded so the self-contrast package remains usable without extra evaluation dependencies. Made-with: Cursor
Resolve the self-contrast pre-commit failures by normalizing tracked dataset and result artifacts, tightening typing in the model client and solver helpers, and making optional package exports degrade gracefully when extra evaluation dependencies are missing. Made-with: Cursor
afkanpour
approved these changes
Mar 27, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR Type
[Feature | Fix | Documentation | Other() ]
Feature
Short Description
Adds a new src/task_solver/self_contrast/ package with five runnable variants for benchmark evaluation: base self-contrast, single-agent baseline, tools-integrated self-contrast, tool-perspective self-contrast, and a new single-agent + tools baseline.
This PR also includes the XFinBench dataset snapshots used for evaluation, saved result files for V1-V5 runs, a comparison summary JSON, an interactive comparison dashboard, and a README that documents setup, execution, evaluation variants, and experimental results.
To keep the self-contrast package usable in environments without optional dependencies, it adds defensive import guards in src/task_solver/init.py and src/utils/init.py.
Tests Added
Added tests/src/task_solver/self_contrast/test_solver.py.
The tests cover:
skipping the contrast phase when all perspectives already agree locally
retrying tool-enabled solving when the model response does not include a runnable code block on the first try
merging numerically equivalent decimal and percentage answers into the same majority-vote bucket