MML Attack#411
Merged
Merged
Conversation
- Add 'mixed' encoding mode (word_replacement + mirror + rotation) - Add encode_mixed() to image_encoder with combined transformations - Add MIXED_GAME_PROMPT and MIXED_CONTROL_PROMPT templates - Update MMLParams Literal type to include 'mixed' - Add _warn_if_not_vlm() validation in MMLAttack - Inject result_id from tracker into generation results for server sync - Update docs attack index with MML entry
| ] | ||
|
|
||
| @with_tui_logging(logger_name="hackagent.attacks", level=logging.INFO) | ||
| def run(self, goals: List[str]) -> List[Dict]: |
| metadata = self.agent_router.backend_agent.metadata | ||
| if isinstance(metadata, dict): | ||
| model_name = metadata.get("name") or metadata.get("model_name") | ||
| except AttributeError: |
| from .prompts import get_prompt_template | ||
|
|
||
| if TYPE_CHECKING: | ||
| from hackagent.router.tracking import Tracker |
- Add GuardrailExtractor for parsing guardrail events from agent responses - Integrate before/after guardrail detection in router - Track guardrail events in coordinator and tracker - Update all attack techniques to handle guardrail-blocked responses: baseline, advprefix, bon, cipherchat, flipattack, h4rm3l, pap - Export guardrail utilities from attacks.shared
- Replace guardrail_blocked/guardrail_event with adapter_type: guardrail - Add is_guardrail_response() and get_guardrail_info() to response_utils - Update router to emit structured agent_specific_data (side, categories, reasoning) - Migrate all 10 attack techniques to use canonical detection helper - Update tracker to detect guardrail responses via adapter_type - Switch guardrail.py to JSON-structured output parsing with keyword fallback
- PAIR: pass full guardrail response dict to add_interaction_trace so the dashboard can detect and render guardrail blocks per iteration - TAP: return descriptive guardrail marker string from _query_target instead of None so blocked iterations show guardrail info in traces
- Return the structured guardrail response dict instead of string-encoding it as [GUARDRAIL:xxx], so tracker and dashboard handle it properly - Pass empty string to judges for guardrail-blocked responses (score 0) - Remove [:500] slice on response in trace recording (tracker handles dicts)
AutoDAN-Turbo: - Read phase/subphase from content (not step_name) for DB-loaded traces - Skip bookend traces (PHASE_START/END, SKIP_FINALIZED) - Detect WARMUP_SUMMARY via phase+subphase instead of step_name - Group epochs under iteration sub-headers in the renderer Guardrail display: - Add legacy [GUARDRAIL:xxx] string-pattern fallback in extractor - Add guardrail categories to trace data and rendering templates - Improve guardrail event rendering with structured pre blocks - Propagate _guardrail_categories through all parsing paths
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
Marco Russo (marcorusso97)
approved these changes
Jun 10, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements the Multi-Modal Linkage jailbreak attack for Vision-Language Models, based on:
The MML attack encodes harmful prompts into images using visual transformations, then pairs them with text prompts that instruct a VLM to decode and act on the hidden content.
Encoding modes:
word_replacementmirrorrotatebase64mixedPrompt styles:
game(villain's lair scenario) andcontrol(neutral list-filling).Changes
hackagent/attacks/techniques/mml/— new attack package:attack.py—MMLAttackorchestrator (BaseAttacksubclass) with VLM target validationconfig.py—DEFAULT_MML_CONFIG+ PydanticMMLConfig/MMLParamsmodelsgeneration.py— prompt construction and image-encoded generation stepevaluation.py— response evaluation stepimage_encoder.py— image rendering +encode_word_replacement,encode_mirror,encode_rotate,encode_base64,encode_mixedprompts.py— all prompt templates for each encoding mode × prompt stylehackagent/attacks/registry.py— registersmmlattackhackagent/cli/commands/attack.py— CLI support for MMLhackagent/cli/tui/attack_specs.py— TUI attack spec for MMLhackagent/router/tracking/coordinator.py— injectsresult_idfrom tracker into generation results for server synchackagent/server/dashboard/_page.py— dashboard visualization fix fornum_workers > 1tests/unit/attacks/mml/— comprehensive unit tests (attack, config, generation, image encoder, prompts)Fixes #350