fix: propagate adapter/execution errors to dashboard instead of masking as failed attacks by RPaolino · Pull Request #346 · AISecurityLab/hackagent

Raffaele Paolino (RPaolino) · 2026-04-29T12:13:52Z

Problem

Adapter-level errors (timeouts, connection failures, empty responses from infrastructure) were silently lost at multiple points in the attack pipeline. Instead of surfacing as Error in the dashboard, they appeared as regular Failed / Mitigated results with score=0 — making infrastructure problems look like the target model simply resisted the attack.

Root Causes & Fixes

1. Baseline, AdvPrefix — error detection and propagation (`227b557`)

Orchestrator: _normalize_attack_results() extracts the evaluated list from Baseline's {"evaluated": [...], "summary": [...]} dict. Iterating the dict directly yielded keys as strings, producing phantom results.
Baseline generation: detect adapter errors from the router response dict (error_message field), not only Python exceptions.
BaseEvaluationStep: add _detect_error_indices() / _mark_error_rows() to stamp is_error=True on rows that carry an error with no usable completion. Exclude error rows from ASR denominator and tracker updates.
BaseJudgeEvaluator: separate error rows before the trivial/non-trivial split so they don't get overwritten with "filtered: trivial completion".
Baseline evaluation: skip error rows from pattern/keyword evaluation; finalize all-error goals with ERROR_AGENT_RESPONSE.
Tracker.finalize_goal: accept an optional evaluation_status parameter so callers can explicitly set ERROR_AGENT_RESPONSE.
17 new tests in test_error_propagation.py.

2. BoN — error propagation from candidates to best result (`1b81d1e`)

BoN generation: _search_single_goal() now copies the error message from failed step candidates into best_result when no step produced a valid response.
BoN evaluation: set is_error=True on error rows (previously only set best_score=0).
TrackingCoordinator.finalize_all_goals: detect all-error goals and finalize with ERROR_AGENT_RESPONSE instead of FAILED_JAILBREAK.

3. Dashboard — unified labels, colors, and error badges (`d146abe`)

Rename failed bucket to error across helpers, storage, and tests.
Errors use yellow (warning) everywhere; jailbreaks red, mitigated green.
Add error count badge to the results bar alongside jailbreaks/mitigations.
Rename chart labels: Vulnerable → Jailbreaks, Passed/Safe → Mitigated, Failed → Error.
Runs with errors show as COMPLETED — errors are visible in the per-result badges.

4. PAIR — surface errors in metadata and track infra failures (`aa3a252`)

_query_target_simple: store error_message in metadata when the adapter errors.
_run_single_goal: count attacker_errors and target_errors across iterations; mark result is_error=True when all iterations failed due to infrastructure errors.
PAIREvaluation: skip error rows from ASR, stamp descriptive evaluation_notes.

5. PAP — track errors across techniques (`78c3e37`)

_attack_single_goal: count attacker_errors and target_errors across techniques; mark result is_error=True when all techniques failed.
PAPEvaluation: detect and exclude error rows from ASR.

6. AdvPrefix — preserve error rows through the full pipeline (`66d5226`)

completions.py: propagate the normalized error key so _detect_error_indices can identify error rows.
evaluation.py: detect/mark error rows before judge evaluation; preserve them through NLL filtering, aggregation, and selection so they reach finalize_all_goals with is_error=True.
sync.py: skip is_error rows in sync_evaluation_to_server so the coordinator's ERROR_AGENT_RESPONSE is not overwritten by FAILED_JAILBREAK.

7. TAP — don't count infra errors as scoring failures (`a20b768`)

_query_target: return (response, error_message) tuple; surface adapter exceptions as error messages.
Consumption sites: skip scoring when target_error is set and response is empty; track target_error_count; mark result is_error=True when all queries failed.

8. AutoDAN-Turbo — skip scorer on infra errors, propagate to results (`a48daca`)

core.query_target: wrap router call in try/except; pull error_message from dict responses.
warm_up / lifelong: detect (target_resp, target_error) tuple; skip score_response on empty error responses; track target_error_count; mark per-goal result is_error=True.
AutoDANTurboEvaluation: detect is_error rows, force score=0, set scorer_verdict='error', stamp evaluation_notes.

…iled attacks Adapter-level errors (timeouts, connection failures) were silently lost at multiple points in the pipeline, causing error rows to appear as regular failed attacks with score=0 in the dashboard. Orchestrator — normalize dict-style results (baseline): - Add _normalize_attack_results() to extract the 'evaluated' list from baseline's {"evaluated": [...], "summary": [...]} dict before passing to the evaluation pipeline. Iterating a dict yielded its keys as strings, producing 'Evaluation row N is str, coercing to dict' warnings and phantom results with completion='evaluated'/'summary'. Baseline generation — detect adapter errors from router: - Check response.get('error_message') on the router dict response. Previously only Python exceptions were caught; silent adapter errors (e.g. Ollama timeout dicts) produced empty completions with no error marker. BaseEvaluationStep — error-row detection and exclusion: - Add _detect_error_indices() / _mark_error_rows() to stamp is_error=True on rows carrying error/error_message with no usable completion. - Pass error_indices to _enrich_items_with_scores() so error rows get score=0, success=False without being computed as normal results. - Exclude error rows from ASR denominator in _log_evaluation_asr(). - Record error traces in _update_tracker() instead of judge evaluations. BaseJudgeEvaluator — preserve error context: - Separate error rows before the trivial/non-trivial split so they get 'execution/adapter error: <msg>' instead of being overwritten with 'filtered out: trivial/placeholder completion'. Baseline evaluation — error-aware pattern matching and finalization: - Skip error rows from pattern/keyword evaluation. - Detect all-error goals and finalize with ERROR_AGENT_RESPONSE instead of FAILED_JAILBREAK via new evaluation_status param on finalize_goal(). Tracker.finalize_goal — evaluation_status override: - Accept optional evaluation_status parameter to allow callers to set ERROR_AGENT_RESPONSE (or any other status) explicitly. AdvPrefix — normalize error field name: - Add 'error' key alongside existing 'error_message' for consistency with the evaluation pipeline's error detection. Tests: 17 new tests in test_error_propagation.py covering all layers.

When all BoN candidates for a goal returned adapter-level errors (timeouts, connection failures), the errors were silently lost and the goal appeared as FAILED_JAILBREAK with score=0 in the dashboard. Root cause: _search_single_goal() never copied the error message from failed step candidates into best_result, so best_result.error stayed None through evaluation and finalization. BoN generation — propagate error from failed steps: - Copy step_best.error into best_result when all candidates in a step fail and no prior step produced a valid response. BoN evaluation — mark error rows with is_error flag: - Set is_error=True on error rows so the coordinator can detect them (previously only set best_score=0 and success=False). TrackingCoordinator.finalize_all_goals — detect all-error goals: - Check for goals where every result has is_error or error-without- response, and finalize with ERROR_AGENT_RESPONSE instead of FAILED_JAILBREAK. This applies to all techniques that use the coordinator (BoN, AdvPrefix, and any future techniques).

Dashboard label consistency: - Rename 'Failed' bucket to 'Error' across helpers, storage, and tests - _result_bucket() returns 'error' instead of 'failed' for error statuses - _eval_label() returns 'Error' instead of 'Failed' for error bucket - _eval_color() returns 'warning' (yellow) instead of 'negative' for errors - Storage count_result_buckets() returns 'error' key instead of 'failed' Dashboard color consistency: - Errors use yellow (#eab308 / warning) everywhere instead of orange/red - Jailbreaks use red (#ef4444 / negative), Mitigated uses green (#22c55e / positive) - Distribution chart, risk donut, and legend all use unified color scheme - HTML template evalBadgeClass() uses yellow for error statuses Results bar — add error count badge: - Add errors counter to _summarize_run_results() alongside jailbreaks/mitigations - Propagate errors field to run row dicts at all three call sites - NiceGUI: add yellow q-badge showing error count in body-cell-results slot - HTML template already had error badges from prior work API routes — sync bucket rename: - _api.py _derive_run_status() and stats route use 'error' bucket name - Remove FAILED run status — runs with errors show as COMPLETED Label updates: - 'Vulnerable' → 'Jailbreaks' in bar chart and legends - 'Passed / Safe' → 'Mitigated' in HTML template chart labels - 'Failed' → 'Error' in JS evalLabel() for FAILED_CRITERIA and ERROR statuses - Runs with errors show as COMPLETED (green) — errors visible in results badges

_query_target_simple — surface errors in metadata: - Store error_message in metadata dict when adapter returns an error response or when an exception is raised, so callers can distinguish infrastructure failures from empty responses. _run_single_goal — track attacker and target errors: - Count attacker_errors (prompt generation failures) and target_errors (adapter timeouts, connection failures) across iterations. - When all iterations failed due to infrastructure errors and no usable target response was obtained, mark the result with is_error=True, error, and error_message fields. run() — finalize with ERROR_AGENT_RESPONSE: - Detect is_error on results and pass evaluation_status=EvaluationStatusEnum.ERROR_AGENT_RESPONSE to Tracker.finalize_goal() so the dashboard shows errors instead of plain FAILED_JAILBREAK. PAIREvaluation — error-aware ASR: - Skip is_error rows when computing evaluated_count and ASR. - Stamp error rows with descriptive evaluation_notes. - Log error count alongside success rate.

_create_attacker_router — forward config to adapter: - Add role='attacker' to metadata for clearer logging/error attribution. - Forward operational keys (timeout, max_tokens, temperature, top_p, top_k, num_ctx) from attacker config to adapter metadata so the adapter respects per-model settings at init time. _attack_single_goal — track attacker and target errors: - Count attacker_errors (empty attacker responses) and target_errors (adapter timeouts, connection failures) across techniques. - When all techniques failed due to infrastructure errors and no usable target response was obtained, mark the result with is_error=True, error, and error_message fields. PAPEvaluation — error-aware ASR: - Detect error rows via is_error flag or error+no-response pattern. - Stamp is_error=True and descriptive evaluation_notes on error rows. - Exclude error rows from ASR denominator. - Log error count alongside success rate.

…igated Error rows (e.g. timeouts) were silently lost through the evaluation pipeline and finalized as FAILED_JAILBREAK ("Mitigated") instead of ERROR_AGENT_RESPONSE. Root causes fixed: - completions.py: propagate the normalized 'error' key so _detect_error_indices can identify error rows downstream - evaluation.py: detect/mark error rows before judge evaluation; preserve error rows through NLL filtering, aggregation, and selection so they reach finalize_all_goals with is_error=True - sync.py: skip is_error rows in sync_evaluation_to_server so the coordinator's ERROR_AGENT_RESPONSE is not overwritten by FAILED_JAILBREAK

_query_target — return (content, error_message) tuple: - Wrap router call in try/except so adapter exceptions surface as error messages instead of being lost. - Pull error_message/error from dict responses so callers can distinguish empty responses from infrastructure failures. _query_attacker — return (parsed_dict, error_message) tuple: - Same treatment as the target path so attacker LLM timeouts and connection failures are visible to run_single_goal. _expand_one_branch — propagate attacker errors: - Always return a dict (no more None) carrying the last attacker error so run_single_goal can count attacker_errors per branch. run_single_goal — track attacker and target errors: - Count attacker_errors (failed branch expansions) and target_errors (adapter timeouts, connection failures) across depth iterations. - When best_response is empty AND any infra error occurred, mark the result with is_error=True, error, and error_message fields so the TrackingCoordinator finalizes the goal as ERROR_AGENT_RESPONSE instead of FAILED_JAILBREAK ("Mitigated"). TapEvaluation.execute — error-aware finalization: - Detect is_error rows up front, stamp evaluation_notes with the underlying error, skip judge re-scoring, and emit a dedicated tracker evaluation trace so the dashboard reflects the failure.

Errors from the target adapter (timeouts, connection failures) caused empty target responses, which the Scorer LLM could then hallucinate a high score for — making infrastructure failures look like successful jailbreaks in the dashboard. core.query_target — return (response, error_message) tuple: - Wrap router call in try/except so adapter exceptions surface as error messages instead of being lost. - Pull error_message/error from dict responses so callers can distinguish empty responses from infrastructure failures. warm_up / lifelong — skip scoring on infrastructure errors: - Detect (target_resp, target_error) tuple from query_target and short-circuit score_response with score=0 and an error assessment whenever an empty response was caused by an adapter error. - Track target_error_count and last_target_error across iterations in lifelong; when no jailbreak was scored AND any infra error occurred, mark the per-goal result with is_error=True, error, and error_message fields. AutoDANTurboEvaluation — error-aware finalization: - Detect is_error rows up front, force score=0, set scorer_verdict to 'error', stamp evaluation_notes/evaluation_summary with the underlying error, and skip the harmful/safe verdict so the TrackingCoordinator finalizes the goal as ERROR_AGENT_RESPONSE instead of FAILED_JAILBREAK ("Mitigated") or — worse — SUCCESSFUL_JAILBREAK from a hallucinated scorer. tests: - Update test_query_target to expect the new tuple return. - Update test_execute_with_only_errors_returns_failure to assert the new is_error/scorer_verdict='error' contract.

+            )
+            if is_error_row:
+                err_msg = item.get("error") or item.get("error_message") or "unknown"
+                auto_score = 0.0


+            for c in mock_backend.update_result.call_args_list
+            if "evaluation_status" in (c.kwargs or {})
+        ]
+        self.assertTrue(len(finalize_calls) >= 1)


+            for c in mock_backend.update_result.call_args_list
+            if "evaluation_status" in (c.kwargs or {})
+        ]
+        self.assertTrue(len(finalize_calls) >= 1)


+            for c in mock_backend.update_result.call_args_list
+            if "evaluation_status" in (c.kwargs or {})
+        ]
+        self.assertTrue(len(finalize_calls) >= 1)


Marco Russo (marcorusso97)

ok

Raffaele Paolino (RPaolino) added 8 commits April 24, 2026 12:16

github-code-quality Bot found potential problems Apr 29, 2026

View reviewed changes

Marco Russo (marcorusso97) approved these changes May 8, 2026

View reviewed changes

Marco Russo (marcorusso97) merged commit 7abbb0d into main May 8, 2026
20 of 21 checks passed

Marco Russo (marcorusso97) deleted the fix/propagate-errors-to-results branch May 8, 2026 15:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: propagate adapter/execution errors to dashboard instead of masking as failed attacks#346

fix: propagate adapter/execution errors to dashboard instead of masking as failed attacks#346
Marco Russo (marcorusso97) merged 8 commits into
mainfrom
fix/propagate-errors-to-results

Raffaele Paolino (RPaolino) commented Apr 29, 2026

Uh oh!

Marco Russo (marcorusso97) left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Raffaele Paolino (RPaolino) commented Apr 29, 2026

Problem

Root Causes & Fixes

1. Baseline, AdvPrefix — error detection and propagation (227b557)

2. BoN — error propagation from candidates to best result (1b81d1e)

3. Dashboard — unified labels, colors, and error badges (d146abe)

4. PAIR — surface errors in metadata and track infra failures (aa3a252)

5. PAP — track errors across techniques (78c3e37)

6. AdvPrefix — preserve error rows through the full pipeline (66d5226)

7. TAP — don't count infra errors as scoring failures (a20b768)

8. AutoDAN-Turbo — skip scorer on infra errors, propagate to results (a48daca)

Uh oh!

Marco Russo (marcorusso97) left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. Baseline, AdvPrefix — error detection and propagation (`227b557`)

2. BoN — error propagation from candidates to best result (`1b81d1e`)

3. Dashboard — unified labels, colors, and error badges (`d146abe`)

4. PAIR — surface errors in metadata and track infra failures (`aa3a252`)

5. PAP — track errors across techniques (`78c3e37`)

6. AdvPrefix — preserve error rows through the full pipeline (`66d5226`)

7. TAP — don't count infra errors as scoring failures (`a20b768`)

8. AutoDAN-Turbo — skip scorer on infra errors, propagate to results (`a48daca`)