fix: propagate adapter/execution errors to dashboard instead of masking as failed attacks#346
Merged
Conversation
…iled attacks
Adapter-level errors (timeouts, connection failures) were silently lost
at multiple points in the pipeline, causing error rows to appear as
regular failed attacks with score=0 in the dashboard.
Orchestrator — normalize dict-style results (baseline):
- Add _normalize_attack_results() to extract the 'evaluated' list from
baseline's {"evaluated": [...], "summary": [...]} dict before passing
to the evaluation pipeline. Iterating a dict yielded its keys as
strings, producing 'Evaluation row N is str, coercing to dict'
warnings and phantom results with completion='evaluated'/'summary'.
Baseline generation — detect adapter errors from router:
- Check response.get('error_message') on the router dict response.
Previously only Python exceptions were caught; silent adapter errors
(e.g. Ollama timeout dicts) produced empty completions with no error
marker.
BaseEvaluationStep — error-row detection and exclusion:
- Add _detect_error_indices() / _mark_error_rows() to stamp is_error=True
on rows carrying error/error_message with no usable completion.
- Pass error_indices to _enrich_items_with_scores() so error rows get
score=0, success=False without being computed as normal results.
- Exclude error rows from ASR denominator in _log_evaluation_asr().
- Record error traces in _update_tracker() instead of judge evaluations.
BaseJudgeEvaluator — preserve error context:
- Separate error rows before the trivial/non-trivial split so they get
'execution/adapter error: <msg>' instead of being overwritten with
'filtered out: trivial/placeholder completion'.
Baseline evaluation — error-aware pattern matching and finalization:
- Skip error rows from pattern/keyword evaluation.
- Detect all-error goals and finalize with ERROR_AGENT_RESPONSE instead
of FAILED_JAILBREAK via new evaluation_status param on finalize_goal().
Tracker.finalize_goal — evaluation_status override:
- Accept optional evaluation_status parameter to allow callers to set
ERROR_AGENT_RESPONSE (or any other status) explicitly.
AdvPrefix — normalize error field name:
- Add 'error' key alongside existing 'error_message' for consistency
with the evaluation pipeline's error detection.
Tests: 17 new tests in test_error_propagation.py covering all layers.
When all BoN candidates for a goal returned adapter-level errors (timeouts, connection failures), the errors were silently lost and the goal appeared as FAILED_JAILBREAK with score=0 in the dashboard. Root cause: _search_single_goal() never copied the error message from failed step candidates into best_result, so best_result.error stayed None through evaluation and finalization. BoN generation — propagate error from failed steps: - Copy step_best.error into best_result when all candidates in a step fail and no prior step produced a valid response. BoN evaluation — mark error rows with is_error flag: - Set is_error=True on error rows so the coordinator can detect them (previously only set best_score=0 and success=False). TrackingCoordinator.finalize_all_goals — detect all-error goals: - Check for goals where every result has is_error or error-without- response, and finalize with ERROR_AGENT_RESPONSE instead of FAILED_JAILBREAK. This applies to all techniques that use the coordinator (BoN, AdvPrefix, and any future techniques).
Dashboard label consistency: - Rename 'Failed' bucket to 'Error' across helpers, storage, and tests - _result_bucket() returns 'error' instead of 'failed' for error statuses - _eval_label() returns 'Error' instead of 'Failed' for error bucket - _eval_color() returns 'warning' (yellow) instead of 'negative' for errors - Storage count_result_buckets() returns 'error' key instead of 'failed' Dashboard color consistency: - Errors use yellow (#eab308 / warning) everywhere instead of orange/red - Jailbreaks use red (#ef4444 / negative), Mitigated uses green (#22c55e / positive) - Distribution chart, risk donut, and legend all use unified color scheme - HTML template evalBadgeClass() uses yellow for error statuses Results bar — add error count badge: - Add errors counter to _summarize_run_results() alongside jailbreaks/mitigations - Propagate errors field to run row dicts at all three call sites - NiceGUI: add yellow q-badge showing error count in body-cell-results slot - HTML template already had error badges from prior work API routes — sync bucket rename: - _api.py _derive_run_status() and stats route use 'error' bucket name - Remove FAILED run status — runs with errors show as COMPLETED Label updates: - 'Vulnerable' → 'Jailbreaks' in bar chart and legends - 'Passed / Safe' → 'Mitigated' in HTML template chart labels - 'Failed' → 'Error' in JS evalLabel() for FAILED_CRITERIA and ERROR statuses - Runs with errors show as COMPLETED (green) — errors visible in results badges
_query_target_simple — surface errors in metadata: - Store error_message in metadata dict when adapter returns an error response or when an exception is raised, so callers can distinguish infrastructure failures from empty responses. _run_single_goal — track attacker and target errors: - Count attacker_errors (prompt generation failures) and target_errors (adapter timeouts, connection failures) across iterations. - When all iterations failed due to infrastructure errors and no usable target response was obtained, mark the result with is_error=True, error, and error_message fields. run() — finalize with ERROR_AGENT_RESPONSE: - Detect is_error on results and pass evaluation_status=EvaluationStatusEnum.ERROR_AGENT_RESPONSE to Tracker.finalize_goal() so the dashboard shows errors instead of plain FAILED_JAILBREAK. PAIREvaluation — error-aware ASR: - Skip is_error rows when computing evaluated_count and ASR. - Stamp error rows with descriptive evaluation_notes. - Log error count alongside success rate.
_create_attacker_router — forward config to adapter: - Add role='attacker' to metadata for clearer logging/error attribution. - Forward operational keys (timeout, max_tokens, temperature, top_p, top_k, num_ctx) from attacker config to adapter metadata so the adapter respects per-model settings at init time. _attack_single_goal — track attacker and target errors: - Count attacker_errors (empty attacker responses) and target_errors (adapter timeouts, connection failures) across techniques. - When all techniques failed due to infrastructure errors and no usable target response was obtained, mark the result with is_error=True, error, and error_message fields. PAPEvaluation — error-aware ASR: - Detect error rows via is_error flag or error+no-response pattern. - Stamp is_error=True and descriptive evaluation_notes on error rows. - Exclude error rows from ASR denominator. - Log error count alongside success rate.
…igated Error rows (e.g. timeouts) were silently lost through the evaluation pipeline and finalized as FAILED_JAILBREAK ("Mitigated") instead of ERROR_AGENT_RESPONSE. Root causes fixed: - completions.py: propagate the normalized 'error' key so _detect_error_indices can identify error rows downstream - evaluation.py: detect/mark error rows before judge evaluation; preserve error rows through NLL filtering, aggregation, and selection so they reach finalize_all_goals with is_error=True - sync.py: skip is_error rows in sync_evaluation_to_server so the coordinator's ERROR_AGENT_RESPONSE is not overwritten by FAILED_JAILBREAK
_query_target — return (content, error_message) tuple:
- Wrap router call in try/except so adapter exceptions surface as
error messages instead of being lost.
- Pull error_message/error from dict responses so callers can
distinguish empty responses from infrastructure failures.
_query_attacker — return (parsed_dict, error_message) tuple:
- Same treatment as the target path so attacker LLM timeouts and
connection failures are visible to run_single_goal.
_expand_one_branch — propagate attacker errors:
- Always return a dict (no more None) carrying the last attacker
error so run_single_goal can count attacker_errors per branch.
run_single_goal — track attacker and target errors:
- Count attacker_errors (failed branch expansions) and target_errors
(adapter timeouts, connection failures) across depth iterations.
- When best_response is empty AND any infra error occurred, mark the
result with is_error=True, error, and error_message fields so the
TrackingCoordinator finalizes the goal as ERROR_AGENT_RESPONSE
instead of FAILED_JAILBREAK ("Mitigated").
TapEvaluation.execute — error-aware finalization:
- Detect is_error rows up front, stamp evaluation_notes with the
underlying error, skip judge re-scoring, and emit a dedicated
tracker evaluation trace so the dashboard reflects the failure.
Errors from the target adapter (timeouts, connection failures) caused
empty target responses, which the Scorer LLM could then hallucinate a
high score for — making infrastructure failures look like successful
jailbreaks in the dashboard.
core.query_target — return (response, error_message) tuple:
- Wrap router call in try/except so adapter exceptions surface as
error messages instead of being lost.
- Pull error_message/error from dict responses so callers can
distinguish empty responses from infrastructure failures.
warm_up / lifelong — skip scoring on infrastructure errors:
- Detect (target_resp, target_error) tuple from query_target and
short-circuit score_response with score=0 and an error assessment
whenever an empty response was caused by an adapter error.
- Track target_error_count and last_target_error across iterations
in lifelong; when no jailbreak was scored AND any infra error
occurred, mark the per-goal result with is_error=True, error,
and error_message fields.
AutoDANTurboEvaluation — error-aware finalization:
- Detect is_error rows up front, force score=0, set scorer_verdict
to 'error', stamp evaluation_notes/evaluation_summary with the
underlying error, and skip the harmful/safe verdict so the
TrackingCoordinator finalizes the goal as ERROR_AGENT_RESPONSE
instead of FAILED_JAILBREAK ("Mitigated") or — worse —
SUCCESSFUL_JAILBREAK from a hallucinated scorer.
tests:
- Update test_query_target to expect the new tuple return.
- Update test_execute_with_only_errors_returns_failure to assert
the new is_error/scorer_verdict='error' contract.
| ) | ||
| if is_error_row: | ||
| err_msg = item.get("error") or item.get("error_message") or "unknown" | ||
| auto_score = 0.0 |
| for c in mock_backend.update_result.call_args_list | ||
| if "evaluation_status" in (c.kwargs or {}) | ||
| ] | ||
| self.assertTrue(len(finalize_calls) >= 1) |
| for c in mock_backend.update_result.call_args_list | ||
| if "evaluation_status" in (c.kwargs or {}) | ||
| ] | ||
| self.assertTrue(len(finalize_calls) >= 1) |
| for c in mock_backend.update_result.call_args_list | ||
| if "evaluation_status" in (c.kwargs or {}) | ||
| ] | ||
| self.assertTrue(len(finalize_calls) >= 1) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Adapter-level errors (timeouts, connection failures, empty responses from infrastructure) were silently lost at multiple points in the attack pipeline. Instead of surfacing as Error in the dashboard, they appeared as regular Failed / Mitigated results with
score=0— making infrastructure problems look like the target model simply resisted the attack.Root Causes & Fixes
1. Baseline, AdvPrefix — error detection and propagation (
227b557)_normalize_attack_results()extracts theevaluatedlist from Baseline's{"evaluated": [...], "summary": [...]}dict. Iterating the dict directly yielded keys as strings, producing phantom results.error_messagefield), not only Python exceptions.BaseEvaluationStep: add_detect_error_indices()/_mark_error_rows()to stampis_error=Trueon rows that carry an error with no usable completion. Exclude error rows from ASR denominator and tracker updates.BaseJudgeEvaluator: separate error rows before the trivial/non-trivial split so they don't get overwritten with "filtered: trivial completion".ERROR_AGENT_RESPONSE.Tracker.finalize_goal: accept an optionalevaluation_statusparameter so callers can explicitly setERROR_AGENT_RESPONSE.test_error_propagation.py.2. BoN — error propagation from candidates to best result (
1b81d1e)_search_single_goal()now copies the error message from failed step candidates intobest_resultwhen no step produced a valid response.is_error=Trueon error rows (previously only setbest_score=0).TrackingCoordinator.finalize_all_goals: detect all-error goals and finalize withERROR_AGENT_RESPONSEinstead ofFAILED_JAILBREAK.3. Dashboard — unified labels, colors, and error badges (
d146abe)failedbucket toerroracross helpers, storage, and tests.warning) everywhere; jailbreaks red, mitigated green.Vulnerable → Jailbreaks,Passed/Safe → Mitigated,Failed → Error.COMPLETED— errors are visible in the per-result badges.4. PAIR — surface errors in metadata and track infra failures (
aa3a252)_query_target_simple: storeerror_messagein metadata when the adapter errors._run_single_goal: countattacker_errorsandtarget_errorsacross iterations; mark resultis_error=Truewhen all iterations failed due to infrastructure errors.PAIREvaluation: skip error rows from ASR, stamp descriptiveevaluation_notes.5. PAP — track errors across techniques (
78c3e37)_attack_single_goal: countattacker_errorsandtarget_errorsacross techniques; mark resultis_error=Truewhen all techniques failed.PAPEvaluation: detect and exclude error rows from ASR.6. AdvPrefix — preserve error rows through the full pipeline (
66d5226)completions.py: propagate the normalizederrorkey so_detect_error_indicescan identify error rows.evaluation.py: detect/mark error rows before judge evaluation; preserve them through NLL filtering, aggregation, and selection so they reachfinalize_all_goalswithis_error=True.sync.py: skipis_errorrows insync_evaluation_to_serverso the coordinator'sERROR_AGENT_RESPONSEis not overwritten byFAILED_JAILBREAK.7. TAP — don't count infra errors as scoring failures (
a20b768)_query_target: return(response, error_message)tuple; surface adapter exceptions as error messages.target_erroris set and response is empty; tracktarget_error_count; mark resultis_error=Truewhen all queries failed.8. AutoDAN-Turbo — skip scorer on infra errors, propagate to results (
a48daca)core.query_target: wrap router call in try/except; pullerror_messagefrom dict responses.warm_up/lifelong: detect(target_resp, target_error)tuple; skipscore_responseon empty error responses; tracktarget_error_count; mark per-goal resultis_error=True.AutoDANTurboEvaluation: detectis_errorrows, forcescore=0, setscorer_verdict='error', stampevaluation_notes.