MML Attack by RPaolino · Pull Request #411 · AISecurityLab/hackagent

Raffaele Paolino (RPaolino) · 2026-06-01T07:38:23Z

Summary

Implements the Multi-Modal Linkage jailbreak attack for Vision-Language Models, based on:

Wang et al., "Jailbreak Large Vision-Language Models Through Multi-Modal Linkage" (2024) — arXiv:2412.00473

The MML attack encodes harmful prompts into images using visual transformations, then pairs them with text prompts that instruct a VLM to decode and act on the hidden content.

Encoding modes:

Mode	Description
`word_replacement`	Replaces key words with random substitutes, renders to image, provides a dictionary for reconstruction
`mirror`	Renders text in an image, flips it horizontally
`rotate`	Renders text in an image, rotates 180°
`base64`	Encodes the prompt in Base64, renders the encoded string in an image
`mixed`	Combines word replacement, mirroring, and rotation

Prompt styles: game (villain's lair scenario) and control (neutral list-filling).

Changes

hackagent/attacks/techniques/mml/ — new attack package:
- attack.py — MMLAttack orchestrator (BaseAttack subclass) with VLM target validation
- config.py — DEFAULT_MML_CONFIG + Pydantic MMLConfig/MMLParams models
- generation.py — prompt construction and image-encoded generation step
- evaluation.py — response evaluation step
- image_encoder.py — image rendering + encode_word_replacement, encode_mirror, encode_rotate, encode_base64, encode_mixed
- prompts.py — all prompt templates for each encoding mode × prompt style
hackagent/attacks/registry.py — registers mml attack
hackagent/cli/commands/attack.py — CLI support for MML
hackagent/cli/tui/attack_specs.py — TUI attack spec for MML
hackagent/router/tracking/coordinator.py — injects result_id from tracker into generation results for server sync
hackagent/server/dashboard/_page.py — dashboard visualization fix for num_workers > 1
tests/unit/attacks/mml/ — comprehensive unit tests (attack, config, generation, image encoder, prompts)

Fixes #350

- Add 'mixed' encoding mode (word_replacement + mirror + rotation) - Add encode_mixed() to image_encoder with combined transformations - Add MIXED_GAME_PROMPT and MIXED_CONTROL_PROMPT templates - Update MMLParams Literal type to include 'mixed' - Add _warn_if_not_vlm() validation in MMLAttack - Inject result_id from tracker into generation results for server sync - Update docs attack index with MML entry

+        ]
+
+    @with_tui_logging(logger_name="hackagent.attacks", level=logging.INFO)
+    def run(self, goals: List[str]) -> List[Dict]:


+            metadata = self.agent_router.backend_agent.metadata
+            if isinstance(metadata, dict):
+                model_name = metadata.get("name") or metadata.get("model_name")
+        except AttributeError:


+from .prompts import get_prompt_template
+
+if TYPE_CHECKING:
+    from hackagent.router.tracking import Tracker


- Add GuardrailExtractor for parsing guardrail events from agent responses - Integrate before/after guardrail detection in router - Track guardrail events in coordinator and tracker - Update all attack techniques to handle guardrail-blocked responses: baseline, advprefix, bon, cipherchat, flipattack, h4rm3l, pap - Export guardrail utilities from attacks.shared

- Replace guardrail_blocked/guardrail_event with adapter_type: guardrail - Add is_guardrail_response() and get_guardrail_info() to response_utils - Update router to emit structured agent_specific_data (side, categories, reasoning) - Migrate all 10 attack techniques to use canonical detection helper - Update tracker to detect guardrail responses via adapter_type - Switch guardrail.py to JSON-structured output parsing with keyword fallback

- PAIR: pass full guardrail response dict to add_interaction_trace so the dashboard can detect and render guardrail blocks per iteration - TAP: return descriptive guardrail marker string from _query_target instead of None so blocked iterations show guardrail info in traces

- Return the structured guardrail response dict instead of string-encoding it as [GUARDRAIL:xxx], so tracker and dashboard handle it properly - Pass empty string to judges for guardrail-blocked responses (score 0) - Remove [:500] slice on response in trace recording (tracker handles dicts)

AutoDAN-Turbo: - Read phase/subphase from content (not step_name) for DB-loaded traces - Skip bookend traces (PHASE_START/END, SKIP_FINALIZED) - Detect WARMUP_SUMMARY via phase+subphase instead of step_name - Group epochs under iteration sub-headers in the renderer Guardrail display: - Add legacy [GUARDRAIL:xxx] string-pattern fallback in extractor - Add guardrail categories to trace data and rendering templates - Improve guardrail event rendering with structured pre blocks - Propagate _guardrail_categories through all parsing paths

codecov · 2026-06-09T08:22:07Z

Codecov Report

❌ Patch coverage is 70.68966% with 136 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
hackagent/attacks/techniques/mml/evaluation.py	18.18%	54 Missing ⚠️
hackagent/server/dashboard/_page.py	2.38%	41 Missing ⚠️
hackagent/attacks/techniques/mml/generation.py	85.04%	16 Missing ⚠️
hackagent/attacks/techniques/mml/image_encoder.py	90.74%	10 Missing ⚠️
hackagent/router/tracking/coordinator.py	0.00%	6 Missing ⚠️
hackagent/attacks/techniques/mml/attack.py	94.11%	5 Missing ⚠️
hackagent/attacks/techniques/mml/config.py	87.50%	3 Missing ⚠️
hackagent/cli/commands/attack.py	83.33%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Raffaele Paolino (RPaolino) added 4 commits May 28, 2026 09:55

feat: added mml attack

d2f249e

fix: visualization working for num_workers>1

d93a2b7

fix: cli, tui now support mml-attack

70f82a3

Nicola Franco (franconicola) deployed to feat/mml-attack - Docs PR #411 June 1, 2026 07:38 — with Render View deployment

github-code-quality Bot found potential problems Jun 1, 2026

View reviewed changes

Marco Russo (marcorusso97) and others added 14 commits June 5, 2026 10:11

✨ feat: group intents by categories

d693949

📝 docs: documented grouping of intents by categories

cee1093

🐛 fix: fixed enum type for python 310

f3f9c56

🐛 fix: fixed enum for python 3 10

c98e869

feat: guardrail config in run_config

d1ee130

feat: added documentation, cli and tui support of guardrails

0292b3a

fix: prevent TAP attacker from seeing guardrail internals on block

32982fd

feat: added unit tests on guardrails

42500ea

fix: added PIL dependency

f443a04

Nicola Franco (franconicola) deployed to feat/mml-attack - Docs PR #411 June 5, 2026 08:12 — with Render View deployment

github-code-quality Bot found potential problems Jun 5, 2026

View reviewed changes

Comment thread hackagent/datasets/intents.py Fixed

Comment thread hackagent/router/tracking/coordinator.py Fixed

🔀 merge(merge-main): merging main branch

c32f61b

Nicola Franco (franconicola) temporarily deployed to feat/mml-attack - Docs PR #411 June 9, 2026 08:10 — with Render Destroyed

github-code-quality Bot found potential problems Jun 9, 2026

View reviewed changes

Comment thread hackagent/server/dashboard/_page.py Fixed

Comment thread hackagent/server/dashboard/_page.py Fixed

AI4I (AI4I-IT) added 2 commits June 9, 2026 17:12

🐛 fix(merge): repair botched main merge in MML branch

ceb4cab

🎨 style(format): lint formatting

431c515

Marco Russo (marcorusso97) approved these changes Jun 10, 2026

View reviewed changes

Marco Russo (marcorusso97) merged commit f00d373 into main Jun 10, 2026
24 checks passed

Marco Russo (marcorusso97) deleted the feat/mml-attack branch June 10, 2026 07:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MML Attack#411

MML Attack#411
Marco Russo (marcorusso97) merged 21 commits into
mainfrom
feat/mml-attack

Raffaele Paolino (RPaolino) commented Jun 1, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented Jun 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Raffaele Paolino (RPaolino) commented Jun 1, 2026

Summary

Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov Bot commented Jun 9, 2026 •

edited

Loading