Skip to content

Anti-cheating: detect LLM jailbreak attempts #3

@tuxerrante

Description

@tuxerrante

Why

Players could attempt to bypass the investigation workflow by using well-known LLM jailbreak techniques to extract the answer directly from the AI agent, undermining the pedagogical purpose of the game.

Proposed behavior

Provide a hidden detection mechanism that identifies jailbreak-style prompts and penalizes the player:

  1. Detection: Classify user messages against known jailbreak patterns (prompt injection, role reassignment, "ignore previous instructions", etc.).
  2. Penalty: Move the player's score to a negative number and add a "shame" indicator in the final score breakdown.
  3. Logging: Record the attempt for analysis (useful for improving the system prompt's resilience).

References

Implementation considerations

  • Could be done client-side (pattern matching on user input) or server-side (LLM-based classification, or a lightweight classifier).
  • The system prompt already enforces investigation phases — jailbreak detection adds a second layer.
  • The existing scoring system supports negative scores and custom ScoringEvent types.

Acceptance criteria

  • Known jailbreak patterns are detected and penalized.
  • Penalty is visible in the score breakdown with a distinct indicator.
  • Detection does not false-positive on legitimate investigative questions.
  • At least a basic pattern list is maintained and documented.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestgameplayGame mechanics and player experiencepriority: P3Low - nice to have

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions