Skip to content

SIRP-Labs/saf-benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

SAF Benchmark

An open benchmark suite for measuring SOC autonomy levels against the SOC Autonomy Framework (SAF).

Status: v0.1 — Initial scenario definitions. Empirical validation in progress.
Framework: soc-autonomy-framework
Paper: The Autonomous SOC Manifesto


Why This Exists

The SAF paper (Section 7.5) identifies the absence of a comprehensive, vendor-neutral end-to-end SOC benchmark as an open research problem:

"The field lacks a comprehensive, vendor-neutral benchmark spanning the full incident lifecycle. Existing benchmarks address narrow tasks; a community-owned, end-to-end SOC benchmark is urgently needed."

This repo is the beginning of that benchmark.


What It Measures

The SAF Benchmark evaluates a SOC system across four dimensions of the framework:

Dimension How Measured
Decision Scope Which decision categories the system handles in each scenario
Autonomous Action Rate % of scenarios resolved end-to-end without human intervention
Governance Whether required oversight mechanisms are present and enforced
Human Role Nature and frequency of human touchpoints

Scenario Library

Each scenario is a structured incident with defined inputs, expected decision points, and evaluation criteria.

Scenario Incident Type Expected L3 Path Expected L4 Path
PHI-001 Basic phishing email Investigate + recommend block Auto-quarantine + notify
RAN-001 Ransomware lateral movement Investigate + recommend isolation Auto-isolate endpoint + escalate
INS-001 Insider data exfiltration Investigate + recommend review Auto-restrict + alert CISO
APT-001 APT C2 beacon Investigate + recommend block Auto-block + threat hunt trigger
CRE-001 Credential stuffing attack Investigate + recommend rate-limit Auto-rate-limit + MFA enforce

How to Use

Running an Evaluation

pip install saf-benchmark
saf-benchmark evaluate --system <your-system> --scenarios all --output results.json

Or manually: for each scenario in scenarios/, run the incident through your system and record:

  • Whether the system resolved it end-to-end (Y/N)
  • What human touchpoints were required
  • What actions were taken and in what order
  • Confidence scores (if available)

Then score against the evaluation rubric.

Scoring

Each scenario is scored across four SAF dimensions (0–5 per dimension). The system's SAF level is the lowest average dimension score across all scenarios.

See metrics/scoring-guide.md for full scoring methodology.


Scenario Format

Each scenario is a JSON file with a standard structure:

{
  "id": "PHI-001",
  "name": "Basic Phishing Email",
  "category": "phishing",
  "difficulty": "low",
  "inputs": { ... },
  "decision_points": [ ... ],
  "expected_l3_response": { ... },
  "expected_l4_response": { ... },
  "evaluation_criteria": { ... }
}

See scenarios/README.md for the full schema.


Contributing Scenarios

We need more scenarios across more incident types. See CONTRIBUTING.md.

Priority incident types needed:

  • Cloud infrastructure compromise
  • Supply chain attack
  • Zero-day exploitation
  • BEC (Business Email Compromise)
  • DDoS with application layer component
  • Container escape / Kubernetes lateral movement

Related Projects

Repo Description
soc-autonomy-framework The SAF specification
saf-classifier CLI tool to classify a SOC product's autonomy level

Citation

@misc{sirplabs2026safbenchmark,
  title  = {SAF Benchmark: An Open Benchmark Suite for SOC Autonomy Measurement},
  author = {{SIRP Labs}},
  year   = {2026},
  url    = {https://github.com/sirp-labs/saf-benchmark}
}

License

Apache 2.0 — see LICENSE

About

Open benchmark suite for measuring SOC autonomy level against the SAF (L0-L5). Community-owned end-to-end incident scenarios for evaluating autonomous SOC systems.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors