An open benchmark suite for measuring SOC autonomy levels against the SOC Autonomy Framework (SAF).
Status: v0.1 — Initial scenario definitions. Empirical validation in progress.
Framework: soc-autonomy-framework
Paper: The Autonomous SOC Manifesto
The SAF paper (Section 7.5) identifies the absence of a comprehensive, vendor-neutral end-to-end SOC benchmark as an open research problem:
"The field lacks a comprehensive, vendor-neutral benchmark spanning the full incident lifecycle. Existing benchmarks address narrow tasks; a community-owned, end-to-end SOC benchmark is urgently needed."
This repo is the beginning of that benchmark.
The SAF Benchmark evaluates a SOC system across four dimensions of the framework:
| Dimension | How Measured |
|---|---|
| Decision Scope | Which decision categories the system handles in each scenario |
| Autonomous Action Rate | % of scenarios resolved end-to-end without human intervention |
| Governance | Whether required oversight mechanisms are present and enforced |
| Human Role | Nature and frequency of human touchpoints |
Each scenario is a structured incident with defined inputs, expected decision points, and evaluation criteria.
| Scenario | Incident Type | Expected L3 Path | Expected L4 Path |
|---|---|---|---|
| PHI-001 | Basic phishing email | Investigate + recommend block | Auto-quarantine + notify |
| RAN-001 | Ransomware lateral movement | Investigate + recommend isolation | Auto-isolate endpoint + escalate |
| INS-001 | Insider data exfiltration | Investigate + recommend review | Auto-restrict + alert CISO |
| APT-001 | APT C2 beacon | Investigate + recommend block | Auto-block + threat hunt trigger |
| CRE-001 | Credential stuffing attack | Investigate + recommend rate-limit | Auto-rate-limit + MFA enforce |
pip install saf-benchmark
saf-benchmark evaluate --system <your-system> --scenarios all --output results.jsonOr manually: for each scenario in scenarios/, run the incident through your system and record:
- Whether the system resolved it end-to-end (Y/N)
- What human touchpoints were required
- What actions were taken and in what order
- Confidence scores (if available)
Then score against the evaluation rubric.
Each scenario is scored across four SAF dimensions (0–5 per dimension). The system's SAF level is the lowest average dimension score across all scenarios.
See metrics/scoring-guide.md for full scoring methodology.
Each scenario is a JSON file with a standard structure:
{
"id": "PHI-001",
"name": "Basic Phishing Email",
"category": "phishing",
"difficulty": "low",
"inputs": { ... },
"decision_points": [ ... ],
"expected_l3_response": { ... },
"expected_l4_response": { ... },
"evaluation_criteria": { ... }
}See scenarios/README.md for the full schema.
We need more scenarios across more incident types. See CONTRIBUTING.md.
Priority incident types needed:
- Cloud infrastructure compromise
- Supply chain attack
- Zero-day exploitation
- BEC (Business Email Compromise)
- DDoS with application layer component
- Container escape / Kubernetes lateral movement
| Repo | Description |
|---|---|
| soc-autonomy-framework | The SAF specification |
| saf-classifier | CLI tool to classify a SOC product's autonomy level |
@misc{sirplabs2026safbenchmark,
title = {SAF Benchmark: An Open Benchmark Suite for SOC Autonomy Measurement},
author = {{SIRP Labs}},
year = {2026},
url = {https://github.com/sirp-labs/saf-benchmark}
}Apache 2.0 — see LICENSE