fix: normalize TAP judge scores to consistent 1-10 scale#365
Merged
Marco Russo (marcorusso97) merged 2 commits intoMay 22, 2026
Conversation
Problem ------- TAP had two conflicting defaults for success_score_threshold: - DEFAULT_TAP_CONFIG dict: 1 (designed for binary 0/1 judges) - TapParams Pydantic model: 10 (designed for nuanced 1-10 judges) Since the TAP JUDGE_SYSTEM_PROMPT scores on a 1-10 scale (where 1 is the minimum), any non-refusal response scored >= 1 and immediately triggered early stopping when the dict default (1) was used. The pruning min_score was hardcoded to 1, which: - On the 1-10 scale: kept every branch (no pruning effect) - On the 0/1 scale: only kept already-jailbroken branches (too aggressive) Solution -------- - evaluation.py: Auto-detect binary judges (harmbench, harmbench_variant, jailbreakbench) via _judges_are_binary() with robust type inference from type, evaluator_type, and identifier fields. Normalize binary 0/1 scores to 1/10 in score_candidates() so all downstream logic (early stop, pruning, success check) works on a uniform 1-10 scale. - config.py: Align DEFAULT_TAP_CONFIG dict default to 10, matching TapParams. Add min_judge_prune_score (default 3) to both dict and Pydantic model. - generation.py: Read min_judge_prune_score from config (default 3) instead of hardcoded min_score=1. Align success_threshold fallback to 10. Note: The default success_score_threshold is 10 (perfect jailbreak). Users can override via tap_params.success_score_threshold.
The coordinator's _default_goal_scorer uses success_threshold=0.5 by default, which is correct for a 0-1 scale but wrong for TAP's 1-10 normalized scale. A judge score of 1 (= 'no jailbreak') exceeds 0.5 and was incorrectly classified as SUCCESSFUL_JAILBREAK. TAP now passes success_score_threshold (default 10) so only scores >= 10 are classified as successful jailbreaks.
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Marco Russo (marcorusso97)
approved these changes
May 22, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #347 — TAP attack score scale inconsistency causing premature early stopping and incorrect pruning.
Problem
Two issues in the TAP attack's scoring logic:
Score normalization: Binary judges (HarmBench) return 0/1, but TAP's internal logic expects 1-10. With
success_score_thresholddefaulting to1inDEFAULT_TAP_CONFIG, any non-refusal response (score=1) immediately triggers early stopping — making the tree search pointless.finalize_all_goalsthreshold: The evaluation step'sfinalize_all_goalscall didn't pass through the configuredsuccess_score_threshold, causing it to use a different default than the generation step.Changes
generation.py: Add_normalize_score()to map binary (0/1) scores to the 1-10 scale used by TAP's internal system prompt. All judge scores are normalized before being used for pruning, early stopping, and candidate selection.evaluation.py: Passsuccess_thresholdtocoordinator.finalize_all_goals()so the final success determination uses the same threshold as generation.