SAF Benchmark

An open benchmark suite for measuring SOC autonomy levels against the SOC Autonomy Framework (SAF).

Status: v0.1 — Initial scenario definitions. Empirical validation in progress.
Framework: soc-autonomy-framework
Paper: The Autonomous SOC Manifesto

Why This Exists

The SAF paper (Section 7.5) identifies the absence of a comprehensive, vendor-neutral end-to-end SOC benchmark as an open research problem:

"The field lacks a comprehensive, vendor-neutral benchmark spanning the full incident lifecycle. Existing benchmarks address narrow tasks; a community-owned, end-to-end SOC benchmark is urgently needed."

This repo is the beginning of that benchmark.

What It Measures

The SAF Benchmark evaluates a SOC system across four dimensions of the framework:

Dimension	How Measured
Decision Scope	Which decision categories the system handles in each scenario
Autonomous Action Rate	% of scenarios resolved end-to-end without human intervention
Governance	Whether required oversight mechanisms are present and enforced
Human Role	Nature and frequency of human touchpoints

Scenario Library

Each scenario is a structured incident with defined inputs, expected decision points, and evaluation criteria.

Scenario	Incident Type	Expected L3 Path	Expected L4 Path
PHI-001	Basic phishing email	Investigate + recommend block	Auto-quarantine + notify
RAN-001	Ransomware lateral movement	Investigate + recommend isolation	Auto-isolate endpoint + escalate
INS-001	Insider data exfiltration	Investigate + recommend review	Auto-restrict + alert CISO
APT-001	APT C2 beacon	Investigate + recommend block	Auto-block + threat hunt trigger
CRE-001	Credential stuffing attack	Investigate + recommend rate-limit	Auto-rate-limit + MFA enforce

How to Use

Running an Evaluation

pip install saf-benchmark
saf-benchmark evaluate --system <your-system> --scenarios all --output results.json

Or manually: for each scenario in scenarios/, run the incident through your system and record:

Whether the system resolved it end-to-end (Y/N)
What human touchpoints were required
What actions were taken and in what order
Confidence scores (if available)

Then score against the evaluation rubric.

Scoring

Each scenario is scored across four SAF dimensions (0–5 per dimension). The system's SAF level is the lowest average dimension score across all scenarios.

See metrics/scoring-guide.md for full scoring methodology.

Scenario Format

Each scenario is a JSON file with a standard structure:

{
  "id": "PHI-001",
  "name": "Basic Phishing Email",
  "category": "phishing",
  "difficulty": "low",
  "inputs": { ... },
  "decision_points": [ ... ],
  "expected_l3_response": { ... },
  "expected_l4_response": { ... },
  "evaluation_criteria": { ... }
}

See scenarios/README.md for the full schema.

Contributing Scenarios

We need more scenarios across more incident types. See CONTRIBUTING.md.

Priority incident types needed:

Cloud infrastructure compromise
Supply chain attack
Zero-day exploitation
BEC (Business Email Compromise)
DDoS with application layer component
Container escape / Kubernetes lateral movement

Related Projects

Repo	Description
soc-autonomy-framework	The SAF specification
saf-classifier	CLI tool to classify a SOC product's autonomy level

Citation

@misc{sirplabs2026safbenchmark,
  title  = {SAF Benchmark: An Open Benchmark Suite for SOC Autonomy Measurement},
  author = {{SIRP Labs}},
  year   = {2026},
  url    = {https://github.com/sirp-labs/saf-benchmark}
}

License

Apache 2.0 — see LICENSE

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SAF Benchmark

Why This Exists

What It Measures

Scenario Library

How to Use

Running an Evaluation

Scoring

Scenario Format

Contributing Scenarios

Related Projects

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
metrics		metrics
scenarios		scenarios
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

SAF Benchmark

Why This Exists

What It Measures

Scenario Library

How to Use

Running an Evaluation

Scoring

Scenario Format

Contributing Scenarios

Related Projects

Citation

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages