A standardized attack/defense evaluation arena for federated learning security research
English | 简体中文
FedArena is a research platform where you submit FL attack or defense algorithms — via natural language prompt or code — and the system automatically evaluates them against a standardized benchmark matrix and ranks them on a leaderboard.
Built on FastAPI + React + PyTorch, with an OpenAI-compatible LLM integration for prompt-based code generation and experiment planning.
- 2026-04-28 — v0.2.0 released. Phase 1 complete: task queue with concurrency control, draft persistence for prompt mode, training curve visualization, Markdown/PDF report export, failure diagnostics in UI, 72 backend tests.
- 2026-04-20 — v0.1.0 released. Core arena loop complete: LLM-powered prompt mode with code review, benchmark matrix evaluation, leaderboard ranking, CI with 40+ backend tests.
Arena — Submit a new attack or defense (describe it in natural language or paste code). The system generates the implementation, validates it, evaluates it against all opponents in the benchmark matrix, and ranks it on the leaderboard.
Bench — Describe experiments in natural language (e.g. "Compare IPM and Scaling against Krum and Median"). The system parses the intent, plans the M×N experiment matrix, runs them sequentially, and reports results.
Leaderboard — Unified ranking of user submissions alongside baseline methods, with a "Compare in Matrix" feature that overlays any submission onto the baseline heatmap.
LLM Agent — OpenAI-compatible API integration. The agent generates attack/defense code from natural language descriptions, validates it via AST analysis, and triggers evaluation automatically.
CLI Mode — Everything also works via Claude Code skills (/fedarena_arena, /fedarena_bench) or direct Python module invocation, no web UI required.
Arena prompt: "Design an attack that adaptively scales poisoned updates based on the global model's gradient norm"
→ Agent generates code → AST validation → evaluates vs 7 defenses → ranked on leaderboard
Bench prompt: "Compare IPM and Scaling against Krum and Median"
→ Parses to 2×2 = 4 experiments → runs sequentially → results table
# Or submit code directly:
class MyAttack(ResearchAttackStrategy):
method_name = "arena_attack_my_method"
def attack(self, local_model_params, global_model_params, **kwargs):
return poisoned_params┌──────────────────────────────────────────────────────────────┐
│ React + Vite Frontend │
│ (Dashboard · Arena · Bench · Leaderboard · Jobs · Detail) │
└───────────────────────────┬──────────────────────────────────┘
│ REST + polling
┌───────────────────────────▼──────────────────────────────────┐
│ FastAPI Backend │
│ ┌──────────────┐ ┌──────────────┐ ┌────────────────┐ │
│ │ LLM Agent │ │ Submission │ │ Bench Worker │ │
│ │ (code gen) │ │ Validator │ │ (M×N runner) │ │
│ └──────┬───────┘ └──────┬───────┘ └───────┬────────┘ │
│ │ │ │ │
│ ┌──────▼─────────────────▼──────────────────▼──────────┐ │
│ │ Arena Evaluation Engine │ │
│ │ (registry · runner · matrix · ranking) │ │
│ └──────────────────────────┬───────────────────────────┘ │
│ │ │
│ ┌──────────────┐ ┌──────▼───────┐ ┌──────────────┐ │
│ │ SQLite │ │ fl_core │ │ OpenAI API │ │
│ │ (jobs, subs)│ │ (FL engine) │ │ (LLM calls) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└──────────────────────────────────────────────────────────────┘
- News
- Key Features
- Architecture
- Setup
- Quick Start
- Benchmark Matrix
- Built-in Methods
- Project Structure
- Roadmap
git clone git@github.com:spire-studio/fedarena.git
cd fedarena
uv syncFor the LLM agent (prompt mode), create a .env file:
cp .env.example .env
# Edit .env and set:
# OPENAI_API_KEY=your-api-key
# OPENAI_API_BASE=https://api.openai.com/v1 (or any compatible endpoint)
# DEFAULT_LLM_MODEL=gpt-4oBackend (terminal 1):
PYTHONPATH=libs:apps/backend/runners uv run uvicorn apps.backend.app.main:app \
--host 0.0.0.0 --port 8000 --reload --reload-dir apps/backend/appFrontend (terminal 2):
cd apps/frontend && pnpm install && pnpm dev --host 0.0.0.0Access:
- Frontend:
http://localhost:5173 - API docs:
http://localhost:8000/docs
# Arena: evaluate a submission
PYTHONPATH=libs:apps/backend/runners uv run python -m fl_core.research.arena evaluate \
--method arena_attack_my_method --role attack \
--config configs/research/bench_baseline.yaml \
--matrix results/arena/benchmark_matrix.json
# Bench: run specific experiments
PYTHONPATH=libs:apps/backend/runners uv run python -m fl_core.research.runner \
--attack-method baseline_ipm --defense-method baseline_krum \
--config configs/research/bench_baseline.yaml --seeds 0Arena pre-computes every combination of baseline attacks × baseline defenses on a fixed FL configuration (CIFAR-10 non-IID, 10 clients, FedAvg).
FedAvg Krum Median TrimMean Bulyan CentClip DnC
no_attack 0.6180 0.4808 0.5470 0.6186 0.5389 0.6185 0.6012
gaussian 0.6289 0.4717 0.5620 0.6162 0.5477 0.6476 0.6172
ipm 0.6221 0.4739 0.5780 0.6092 0.5633 0.6229 0.6027
scaling 0.6247 0.4712 0.5738 0.6221 0.5442 0.6225 0.5957
sign_flip 0.6230 0.4676 0.5725 0.6103 0.5482 0.6098 0.6050
alie 0.6223 0.4565 0.5463 0.6118 0.5485 0.6060 0.5974
Generate or refresh:
PYTHONPATH=libs:apps/backend/runners uv run python -m fl_core.research.arena generate \
--config configs/research/bench_baseline.yaml --seeds 0 --output results/arena| Method | Type | Description |
|---|---|---|
gaussian |
Model poisoning | Gaussian noise injection |
scaling |
Model poisoning | Parameter scaling (Bagdasaryan et al., AISTATS '20) |
ipm |
Model poisoning | Inner-product manipulation (Xie et al., ICML '20) |
sign_flip |
Model poisoning | Sign flipping (Li et al., '19) |
alie |
Model poisoning | A Little Is Enough (Baruch et al., NeurIPS '19) |
| Method | Description | Paper |
|---|---|---|
krum |
Distance-score selection | Blanchard et al., NeurIPS '17 |
median |
Coordinate-wise median | Yin et al., ICML '18 |
trimmed_mean |
Trimmed mean | Yin et al., ICML '18 |
bulyan |
Krum selection + coordinate clipping | Mhamdi et al., ICML '18 |
centered_clipping |
Momentum-based clipping | Karimireddy et al., ICML '21 |
dnc |
SVD-based anomaly detection | Shejwalkar & Houmansadr, NDSS '21 |
fedarena/
├── apps/
│ ├── backend/
│ │ ├── app/ # FastAPI application
│ │ │ ├── api/v1/ # REST endpoints (submissions, leaderboard, matrix, bench, agent)
│ │ │ ├── services/ # Business logic (evaluation worker, code validation, LLM agent)
│ │ │ ├── models.py # SQLModel tables (Submission, EvaluationJob, BenchJob)
│ │ │ └── config.py # Pydantic settings (.env loading)
│ │ └── runners/ # FL runtime (core_runtime.py)
│ └── frontend/ # React + Vite + Tailwind + Radix UI
│ └── src/pages/ # Dashboard, Arena, Bench, Leaderboard, Jobs, Detail
├── libs/fl_core/ # FL core library
│ ├── research/ # Arena engine (registry, runner, arena, base classes)
│ │ ├── attacks/ # Baseline + user submissions
│ │ └── defenses/ # Baseline + user submissions
│ ├── federated/ # Server / Client / Aggregation
│ ├── models/ # CNN / ResNet
│ ├── data/ # Dataset loading & partitioning
│ ├── privacy/ # CKKS encryption
│ └── compression/ # Top-K sparsification
├── configs/research/ # Experiment configs
├── results/arena/ # Benchmark matrix + evaluation results
└── .claude/skills/ # CLI skills (fedarena_arena, fedarena_bench)
- Arena evaluation pipeline (submit → validate → evaluate → rank)
- LLM Agent prompt mode with code review step
- Benchmark matrix generation & heatmap visualization
- Leaderboard (attack / defense ranking)
- Bench: natural-language experiment planning & execution
- CI pipeline (lint, type-check, 40+ backend tests)
- Incremental result saving & stale job recovery
- Task queue & concurrency control (replace bare threads)
- Training round logs viewable in detail page
- Auto-generated evaluation reports (Markdown / PDF export)
- Failure diagnostics — surface error reasons & failed experiment details in UI
- Scenario library — multiple datasets, non-IID levels, malicious client ratios, model architectures
- Multi-metric scoring — Accuracy Drop, Convergence Speed, Stability, Runtime Cost, Max Accuracy, per-opponent and summary aggregation
- Multi-dimensional leaderboard — Avg Accuracy, Accuracy Drop, Worst Case, Convergence, Stability columns with sort_by support
- Per-scenario leaderboards
- Method versioning — track iterations of the same attack/defense, side-by-side comparison
- Experiment comparison — cross-run overlay charts, side-by-side metrics
- Analytical reports — LLM-generated strengths/weaknesses, baseline comparison, recommendations, cached per evaluation
- Dashboard page — live arena status, active jobs, top methods, recent submissions
- Navigation restructure — Dashboard / Scenarios / Arena / Bench / Leaderboard / Reports / Methods / Jobs
- Arena UX improvements — left/right layout, template mode, evaluation intensity selection
- Matrix interaction — filter by scenario, click cell for run details & per-seed breakdown
- Sandbox execution — container isolation, timeout, network & filesystem restrictions for user-submitted code
- User & team accounts with permissions
- Challenge mode — fixed scenarios, time-limited competitions, hidden test sets
- Course mode — guided exercises for FL security education
- Resource quotas & scheduling (multi-GPU, multi-user)
- Audit logging
- Dataset & model plugin system for new FL scenarios
- Public leaderboards & embeddable widgets
FedArena is for research and educational use.