Skip to content

MML Attack#411

Merged
Marco Russo (marcorusso97) merged 21 commits into
mainfrom
feat/mml-attack
Jun 10, 2026
Merged

MML Attack#411
Marco Russo (marcorusso97) merged 21 commits into
mainfrom
feat/mml-attack

Conversation

@RPaolino

Copy link
Copy Markdown
Contributor

Summary

Implements the Multi-Modal Linkage jailbreak attack for Vision-Language Models, based on:

Wang et al., "Jailbreak Large Vision-Language Models Through Multi-Modal Linkage" (2024) — arXiv:2412.00473

The MML attack encodes harmful prompts into images using visual transformations, then pairs them with text prompts that instruct a VLM to decode and act on the hidden content.

Encoding modes:

Mode Description
word_replacement Replaces key words with random substitutes, renders to image, provides a dictionary for reconstruction
mirror Renders text in an image, flips it horizontally
rotate Renders text in an image, rotates 180°
base64 Encodes the prompt in Base64, renders the encoded string in an image
mixed Combines word replacement, mirroring, and rotation

Prompt styles: game (villain's lair scenario) and control (neutral list-filling).

Changes

  • hackagent/attacks/techniques/mml/ — new attack package:
    • attack.pyMMLAttack orchestrator (BaseAttack subclass) with VLM target validation
    • config.pyDEFAULT_MML_CONFIG + Pydantic MMLConfig/MMLParams models
    • generation.py — prompt construction and image-encoded generation step
    • evaluation.py — response evaluation step
    • image_encoder.py — image rendering + encode_word_replacement, encode_mirror, encode_rotate, encode_base64, encode_mixed
    • prompts.py — all prompt templates for each encoding mode × prompt style
  • hackagent/attacks/registry.py — registers mml attack
  • hackagent/cli/commands/attack.py — CLI support for MML
  • hackagent/cli/tui/attack_specs.py — TUI attack spec for MML
  • hackagent/router/tracking/coordinator.py — injects result_id from tracker into generation results for server sync
  • hackagent/server/dashboard/_page.py — dashboard visualization fix for num_workers > 1
  • tests/unit/attacks/mml/ — comprehensive unit tests (attack, config, generation, image encoder, prompts)

Fixes #350

- Add 'mixed' encoding mode (word_replacement + mirror + rotation)
- Add encode_mixed() to image_encoder with combined transformations
- Add MIXED_GAME_PROMPT and MIXED_CONTROL_PROMPT templates
- Update MMLParams Literal type to include 'mixed'
- Add _warn_if_not_vlm() validation in MMLAttack
- Inject result_id from tracker into generation results for server sync
- Update docs attack index with MML entry
]

@with_tui_logging(logger_name="hackagent.attacks", level=logging.INFO)
def run(self, goals: List[str]) -> List[Dict]:
metadata = self.agent_router.backend_agent.metadata
if isinstance(metadata, dict):
model_name = metadata.get("name") or metadata.get("model_name")
except AttributeError:
from .prompts import get_prompt_template

if TYPE_CHECKING:
from hackagent.router.tracking import Tracker
Marco Russo (marcorusso97) and others added 14 commits June 5, 2026 10:11
- Add GuardrailExtractor for parsing guardrail events from agent responses
- Integrate before/after guardrail detection in router
- Track guardrail events in coordinator and tracker
- Update all attack techniques to handle guardrail-blocked responses:
  baseline, advprefix, bon, cipherchat, flipattack, h4rm3l, pap
- Export guardrail utilities from attacks.shared
- Replace guardrail_blocked/guardrail_event with adapter_type: guardrail
- Add is_guardrail_response() and get_guardrail_info() to response_utils
- Update router to emit structured agent_specific_data (side, categories, reasoning)
- Migrate all 10 attack techniques to use canonical detection helper
- Update tracker to detect guardrail responses via adapter_type
- Switch guardrail.py to JSON-structured output parsing with keyword fallback
- PAIR: pass full guardrail response dict to add_interaction_trace so
  the dashboard can detect and render guardrail blocks per iteration
- TAP: return descriptive guardrail marker string from _query_target
  instead of None so blocked iterations show guardrail info in traces
- Return the structured guardrail response dict instead of string-encoding
  it as [GUARDRAIL:xxx], so tracker and dashboard handle it properly
- Pass empty string to judges for guardrail-blocked responses (score 0)
- Remove [:500] slice on response in trace recording (tracker handles dicts)
AutoDAN-Turbo:
- Read phase/subphase from content (not step_name) for DB-loaded traces
- Skip bookend traces (PHASE_START/END, SKIP_FINALIZED)
- Detect WARMUP_SUMMARY via phase+subphase instead of step_name
- Group epochs under iteration sub-headers in the renderer

Guardrail display:
- Add legacy [GUARDRAIL:xxx] string-pattern fallback in extractor
- Add guardrail categories to trace data and rendering templates
- Improve guardrail event rendering with structured pre blocks
- Propagate _guardrail_categories through all parsing paths
Comment thread hackagent/datasets/intents.py Fixed
Comment thread hackagent/router/tracking/coordinator.py Fixed
@franconicola Nicola Franco (franconicola) temporarily deployed to feat/mml-attack - Docs PR #411 June 9, 2026 08:10 — with Render Destroyed
Comment thread hackagent/server/dashboard/_page.py Fixed
Comment thread hackagent/server/dashboard/_page.py Fixed
@codecov

codecov Bot commented Jun 9, 2026

Copy link
Copy Markdown

@marcorusso97 Marco Russo (marcorusso97) merged commit f00d373 into main Jun 10, 2026
24 checks passed
@marcorusso97 Marco Russo (marcorusso97) deleted the feat/mml-attack branch June 10, 2026 07:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add MML attack

4 participants