Skip to content
@safety-research

Safety Research

Popular repositories Loading

  1. bloom bloom Public

    bloom - evaluate any behavior immediately  🌸🌱

    Python 1.3k 156

  2. petri petri Public

    An alignment auditing agent capable of quickly exploring alignment hypothesis

    Python 970 145

  3. persona_vectors persona_vectors Public

    Persona Vectors: Monitoring and Controlling Character Traits in Language Models

    Python 386 96

  4. SCONE-bench SCONE-bench Public

    175 29

  5. assistant-axis assistant-axis Public

    The Assistant Axis is a direction in activation space that captures how "Assistant-like" a model's behavior is. Models can drift away from the Assistant during conversations—sometimes toward bizarr…

    Jupyter Notebook 120 35

  6. safety-tooling safety-tooling Public

    Inference API for many LLMs and other useful tools for empirical research

    Python 114 36

Repositories

Showing 10 of 42 repositories
  • crosscoder_emergent_misalignment Public

    Applying crosscoder model diffing to emergently misaligned models

    safety-research/crosscoder_emergent_misalignment’s past year of commit activity
    Python 5 0 55 8 Updated Apr 1, 2026
  • safety-research/auditing-agents’s past year of commit activity
    Python 11 2 1 1 Updated Apr 1, 2026
  • agent-transcript-editor Public

    Web UI for viewing, editing, and AI-assisted red teaming of AI agent transcripts

    safety-research/agent-transcript-editor’s past year of commit activity
    Python 1 0 0 0 Updated Mar 31, 2026
  • trusted-monitor Public

    Evaluate AI agent transcripts for suspicious behavior (0-100 scoring)

    safety-research/trusted-monitor’s past year of commit activity
    Python 1 0 0 0 Updated Mar 28, 2026
  • safety-tooling Public

    Inference API for many LLMs and other useful tools for empirical research

    safety-research/safety-tooling’s past year of commit activity
    Python 114 MIT 36 13 18 Updated Mar 23, 2026
  • introspection-mechanisms Public

    introspection mechanisms

    safety-research/introspection-mechanisms’s past year of commit activity
    Python 3 0 0 0 Updated Mar 21, 2026
  • petri Public

    An alignment auditing agent capable of quickly exploring alignment hypothesis

    safety-research/petri’s past year of commit activity
    Python 970 MIT 145 4 5 Updated Mar 12, 2026
  • PurpleLlama Public Forked from meta-llama/PurpleLlama

    Set of tools to assess and improve LLM security.

    safety-research/PurpleLlama’s past year of commit activity
    Python 0 823 0 0 Updated Feb 23, 2026
  • bloom Public

    bloom - evaluate any behavior immediately  🌸🌱

    safety-research/bloom’s past year of commit activity
    Python 1,270 MIT 156 0 6 Updated Feb 17, 2026
  • casr Public Forked from ispras/casr

    Collect crash (or UndefinedBehaviorSanitizer error) reports, triage, and estimate severity.

    safety-research/casr’s past year of commit activity
    Rust 0 Apache-2.0 36 0 0 Updated Feb 3, 2026

Top languages

Loading…

Most used topics

Loading…