Skip to content

fix: normalize TAP judge scores to consistent 1-10 scale#365

Merged
Marco Russo (marcorusso97) merged 2 commits into
mainfrom
fix/tap-score-scale-inconsistency
May 22, 2026
Merged

fix: normalize TAP judge scores to consistent 1-10 scale#365
Marco Russo (marcorusso97) merged 2 commits into
mainfrom
fix/tap-score-scale-inconsistency

Conversation

@RPaolino

Copy link
Copy Markdown
Contributor

Summary

Fixes #347 — TAP attack score scale inconsistency causing premature early stopping and incorrect pruning.

Problem

Two issues in the TAP attack's scoring logic:

  1. Score normalization: Binary judges (HarmBench) return 0/1, but TAP's internal logic expects 1-10. With success_score_threshold defaulting to 1 in DEFAULT_TAP_CONFIG, any non-refusal response (score=1) immediately triggers early stopping — making the tree search pointless.

  2. finalize_all_goals threshold: The evaluation step's finalize_all_goals call didn't pass through the configured success_score_threshold, causing it to use a different default than the generation step.

Changes

  • generation.py: Add _normalize_score() to map binary (0/1) scores to the 1-10 scale used by TAP's internal system prompt. All judge scores are normalized before being used for pruning, early stopping, and candidate selection.
  • evaluation.py: Pass success_threshold to coordinator.finalize_all_goals() so the final success determination uses the same threshold as generation.

Problem
-------
TAP had two conflicting defaults for success_score_threshold:
- DEFAULT_TAP_CONFIG dict: 1 (designed for binary 0/1 judges)
- TapParams Pydantic model: 10 (designed for nuanced 1-10 judges)

Since the TAP JUDGE_SYSTEM_PROMPT scores on a 1-10 scale (where 1 is
the minimum), any non-refusal response scored >= 1 and immediately
triggered early stopping when the dict default (1) was used.

The pruning min_score was hardcoded to 1, which:
- On the 1-10 scale: kept every branch (no pruning effect)
- On the 0/1 scale: only kept already-jailbroken branches (too aggressive)

Solution
--------
- evaluation.py: Auto-detect binary judges (harmbench, harmbench_variant,
  jailbreakbench) via _judges_are_binary() with robust type inference
  from type, evaluator_type, and identifier fields. Normalize binary
  0/1 scores to 1/10 in score_candidates() so all downstream logic
  (early stop, pruning, success check) works on a uniform 1-10 scale.

- config.py: Align DEFAULT_TAP_CONFIG dict default to 10, matching
  TapParams. Add min_judge_prune_score (default 3) to both dict and
  Pydantic model.

- generation.py: Read min_judge_prune_score from config (default 3)
  instead of hardcoded min_score=1. Align success_threshold fallback
  to 10.

Note: The default success_score_threshold is 10 (perfect jailbreak).
Users can override via tap_params.success_score_threshold.
The coordinator's _default_goal_scorer uses success_threshold=0.5 by default,
which is correct for a 0-1 scale but wrong for TAP's 1-10 normalized scale.
A judge score of 1 (= 'no jailbreak') exceeds 0.5 and was incorrectly
classified as SUCCESSFUL_JAILBREAK.

TAP now passes success_score_threshold (default 10) so only scores >= 10
are classified as successful jailbreaks.
@codecov

codecov Bot commented May 18, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 25.80645% with 23 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
hackagent/attacks/techniques/tap/evaluation.py 19.23% 21 Missing ⚠️
hackagent/attacks/techniques/tap/generation.py 0.00% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

@marcorusso97 Marco Russo (marcorusso97) merged commit 420fcc8 into main May 22, 2026
21 of 22 checks passed
@marcorusso97 Marco Russo (marcorusso97) deleted the fix/tap-score-scale-inconsistency branch May 22, 2026 07:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TAP Attack: score scale inconsistency

2 participants