Why
Players could attempt to bypass the investigation workflow by using well-known LLM jailbreak techniques to extract the answer directly from the AI agent, undermining the pedagogical purpose of the game.
Proposed behavior
Provide a hidden detection mechanism that identifies jailbreak-style prompts and penalizes the player:
- Detection: Classify user messages against known jailbreak patterns (prompt injection, role reassignment, "ignore previous instructions", etc.).
- Penalty: Move the player's score to a negative number and add a "shame" indicator in the final score breakdown.
- Logging: Record the attempt for analysis (useful for improving the system prompt's resilience).
References
Implementation considerations
- Could be done client-side (pattern matching on user input) or server-side (LLM-based classification, or a lightweight classifier).
- The system prompt already enforces investigation phases — jailbreak detection adds a second layer.
- The existing scoring system supports negative scores and custom
ScoringEvent types.
Acceptance criteria
Why
Players could attempt to bypass the investigation workflow by using well-known LLM jailbreak techniques to extract the answer directly from the AI agent, undermining the pedagogical purpose of the game.
Proposed behavior
Provide a hidden detection mechanism that identifies jailbreak-style prompts and penalizes the player:
References
Implementation considerations
ScoringEventtypes.Acceptance criteria