Anti-cheating: detect LLM jailbreak attempts

## Why

Players could attempt to bypass the investigation workflow by using well-known LLM jailbreak techniques to extract the answer directly from the AI agent, undermining the pedagogical purpose of the game.

## Proposed behavior

Provide a hidden detection mechanism that identifies jailbreak-style prompts and penalizes the player:

1. **Detection**: Classify user messages against known jailbreak patterns (prompt injection, role reassignment, "ignore previous instructions", etc.).
2. **Penalty**: Move the player's score to a negative number and add a "shame" indicator in the final score breakdown.
3. **Logging**: Record the attempt for analysis (useful for improving the system prompt's resilience).

## References

- [Jailbreak taxonomy (arXiv 2503.08990)](https://arxiv.org/html/2503.08990v1)
- [Nature: LLM adversarial attacks](https://www.nature.com/articles/s41467-026-69010-1)

## Implementation considerations

- Could be done client-side (pattern matching on user input) or server-side (LLM-based classification, or a lightweight classifier).
- The system prompt already enforces investigation phases — jailbreak detection adds a second layer.
- The existing scoring system supports negative scores and custom `ScoringEvent` types.

## Acceptance criteria

- [ ] Known jailbreak patterns are detected and penalized.
- [ ] Penalty is visible in the score breakdown with a distinct indicator.
- [ ] Detection does not false-positive on legitimate investigative questions.
- [ ] At least a basic pattern list is maintained and documented.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Anti-cheating: detect LLM jailbreak attempts #3

Why

Proposed behavior

References

Implementation considerations

Acceptance criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Anti-cheating: detect LLM jailbreak attempts #3

Description

Why

Proposed behavior

References

Implementation considerations

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions