GPU Server Diagnostic Test Suite

Production-grade GPU validation framework modeled on NVIDIA DCGM architecture. Runs multi-level hardware diagnostics, exports Prometheus metrics, and integrates with Grafana for real-time monitoring.

Built for data center reliability teams, ML infrastructure engineers, and GPU fleet operators.

Why

GPU clusters fail silently — bad ECC memory, throttled clocks, degraded PCIe links — and broken hardware burns compute budget on garbage results. This test suite catches those failures before training starts. Modeled on NVIDIA's DCGM diagnostic framework, it runs 16 hardware validation modules, exports Prometheus metrics for fleet-wide monitoring, and generates JUnit XML reports that plug directly into CI. If you manage GPU infrastructure for ML workloads, this replaces ad-hoc nvidia-smi checks with systematic, repeatable diagnostics.

Features

Multi-level diagnostics — Quick (deployment checks), Medium (+ PCIe, memory, telemetry), Long (+ bandwidth, stress, topology), Extended (+ NCCL, burn-in)
16 diagnostic modules — Driver validation, GPU enumeration, PCIe gen/width/replay, VRAM allocation, pattern verification, XID errors, ECC health, clock throttling, compute stress, memory bandwidth, NVLink P2P, topology mapping, and more
Prometheus metrics exporter — Real-time GPU telemetry on :9835/metrics via prometheus_client; compatible with any standard Prometheus scraper
Live GPU health monitor — monitor command renders a continuously-refreshed Rich table (temperature, power, VRAM, clocks) at a configurable poll interval
Run history — Every diag invocation appends a summary entry to reports/.run_history.jsonl; the history command displays it as a table with pass/fail filtering
Fault injection — Five synthetic fault types (thermal, ecc, pcie, clock, memory) via --inject-fault; failure codes use the DIAG-FI-* prefix to distinguish injected faults from real diagnostic failures
Docker Compose stack — One-command deployment with Prometheus + Grafana (auto-provisioned dashboards)
Hardware profiles — Per-GPU threshold configs (RTX 5070 Ti, A100 80GB, H100 SXM included)
Burn-in mode — Continuous stress testing with configurable duration (up to 24h)
CI/CD integration — JUnit XML output, GitHub Actions pipeline, ruff linting
Rich CLI — Colored terminal output with progress tables

Quick Start

# Install
pip install -e ".[dev]"

# GPU inventory
python -m src.main inventory

# Run diagnostics
python -m src.main diag --level quick        # Deployment checks only (~1s)
python -m src.main diag --level medium       # + PCIe, memory, telemetry (~5s)
python -m src.main diag --level long         # + bandwidth, stress, topology (~30s)
python -m src.main diag --level extended     # + NCCL, burn-in (~60s)

# Run a single named test
python -m src.main diag --test xid_errors

# Export results
python -m src.main diag --level long --output json
python -m src.main diag --level long --output junit --junit-file results.xml

# Burn-in mode (stress test for specified duration)
python -m src.main diag --mode burnin --duration 3600

# Pre-flight check before a training job
python -m src.main diag --level medium --mode preflight

# Inject a synthetic fault to verify failure handling
python -m src.main diag --level quick --inject-fault thermal

# Live GPU health monitor (Ctrl+C to stop)
python -m src.main monitor --interval 5

# View recent diagnostic run history
python -m src.main history
python -m src.main history --failures        # Failed runs only
python -m src.main history --limit 50

# Start standalone Prometheus metrics server
python -m src.main metrics --port 9835

# GPU cleanup (reset clocks, power, CUDA context)
python -m src.main cleanup

Docker Compose

Full observability stack with one command:

docker compose up -d

Service	Port	Description
gpu-diag	9835	Prometheus metrics exporter
Prometheus	9090	Metrics storage and alerting
Grafana	3000	Dashboard visualization (admin/admin)

Requires NVIDIA Container Toolkit.

Architecture

graph TD
    CLI["main.py<br/><i>Click CLI</i>"]

    CLI --> INV["inventory/<br/>GPU Discovery"]
    CLI --> DIAG["diagnostics/<br/>16 Modules"]
    CLI --> MON["monitoring/<br/>Live Health Monitor"]
    CLI --> RPT["reporting/"]
    CLI --> FI["fault_injection/<br/>5 Fault Types"]

    subgraph Diagnostics
        DIAG --> DEP["deployment<br/><small>Driver, GPU count, ECC</small>"]
        DIAG --> HEALTH["gpu_health<br/><small>Temp, power, VRAM, clocks</small>"]
        DIAG --> PCIE["pcie_validation<br/><small>Gen, width, replay counters</small>"]
        DIAG --> PCIEBW["pcie_bandwidth<br/><small>H2D / D2H throughput</small>"]
        DIAG --> MEM["memory_test<br/><small>VRAM alloc + pattern verify</small>"]
        DIAG --> MEMBW["memory_bandwidth<br/><small>HBM bandwidth</small>"]
        DIAG --> COMP["compute_stress<br/><small>SM occupancy stress</small>"]
        DIAG --> SM["sm_stress<br/><small>SM saturation</small>"]
        DIAG --> PWR["power_test<br/><small>Power draw under load</small>"]
        DIAG --> ECC["ecc_health<br/><small>SBE/DBE, row remapping</small>"]
        DIAG --> XID["xid_errors<br/><small>XID event log analysis</small>"]
        DIAG --> CLK["clock_throttle<br/><small>Throttle reason detection</small>"]
        DIAG --> NVL["nvlink_p2p<br/><small>NVLink P2P validation</small>"]
        DIAG --> NCCL["nccl_validation<br/><small>Collective ops (simulated)</small>"]
        DIAG --> TOPO["topology_map<br/><small>PCIe/NVLink topology</small>"]
        DIAG --> CLEAN["gpu_cleanup<br/><small>Post-test GPU reset</small>"]
    end

    subgraph Reporting
        RPT --> RUNNER["test_runner<br/><small>Test orchestration</small>"]
        RPT --> PROM["prometheus<br/><small>:9835/metrics</small>"]
        RPT --> JUNIT["junit_xml<br/><small>CI/CD reports</small>"]
        RPT --> HIST["history<br/><small>JSONL run history</small>"]
        RPT --> MODELS["models<br/><small>TestResult, DiagnosticRun</small>"]
    end

    subgraph Observability ["Docker Compose Stack"]
        PROM --> PROMETHEUS["Prometheus<br/><small>:9090</small>"]
        PROMETHEUS --> GRAFANA["Grafana<br/><small>:3000</small>"]
    end

    DB["database/<br/><small>SQLAlchemy persistence<br/>(planned)</small>"]
    RPT --> DB

    style CLI fill:#4a90d9,color:#fff
    style DIAG fill:#76b900,color:#fff
    style RPT fill:#e6522c,color:#fff
    style FI fill:#d94a4a,color:#fff
    style PROMETHEUS fill:#e6522c,color:#fff
    style GRAFANA fill:#f2a900,color:#fff

Diagnostic Levels

Level	Tests	Duration	Use Case
quick	1	~1s	Smoke test after provisioning
medium	7	~5s	Pre-job validation
long	14	~30s	Scheduled health checks
extended	15	~60s	Full qualification / burn-in

Level contents:

Test Module	quick	medium	long	extended
deployment	✓	✓	✓	✓
gpu_health		✓	✓	✓
pcie_validation		✓	✓	✓
memory_test		✓	✓	✓
xid_errors		✓	✓	✓
clock_throttle		✓	✓	✓
ecc_health		✓	✓	✓
topology_map			✓	✓
pcie_bandwidth			✓	✓
memory_bandwidth			✓	✓
compute_stress			✓	✓
sm_stress			✓	✓
power_test			✓	✓
nvlink_p2p			✓	✓
nccl_validation				✓

Execution Modes

Mode	Description	Stress Duration
standard	Normal test execution (default)	Profile default
preflight	Pre-job health check, shorter stress	30s
burnin	New hardware qualification	Configurable

# Pre-flight check before a training job
python -m src.main diag --level medium --mode preflight

# 8-hour burn-in for new hardware
python -m src.main diag --level extended --mode burnin --duration 28800

Fault Injection

The --inject-fault flag appends a synthetic FAIL result to the diagnostic run, allowing you to verify that alerting pipelines, JUnit reporting, and CI failure gates respond correctly without requiring actual hardware faults.

python -m src.main diag --level quick --inject-fault thermal
python -m src.main diag --level quick --inject-fault ecc
python -m src.main diag --level quick --inject-fault pcie
python -m src.main diag --level quick --inject-fault clock
python -m src.main diag --level quick --inject-fault memory

Injected results carry DIAG-FI-* failure codes, which are distinct from real diagnostic codes and can be filtered in alert rules.

Fault	Simulated Condition	Failure Code
thermal	GPU temperature 95°C, exceeding 85°C threshold	DIAG-FI-300
ecc	Double-bit ECC error (DBE count = 1)	DIAG-FI-401
pcie	PCIe link degraded to Gen4 x8 (expected x16)	DIAG-FI-202
clock	Clock throttle: SW_THERMAL_SLOWDOWN active	DIAG-FI-501
memory	VRAM stress failure at iteration 512	DIAG-FI-102

Failure Codes

Real diagnostic failures use numeric codes to identify the specific check that failed:

Code	Check	Module
DIAG-001	Driver not loaded	deployment
DIAG-002	GPU count mismatch	deployment
DIAG-003	GPU model mismatch	deployment
DIAG-004	ECC mode mismatch	deployment
DIAG-100	VRAM allocation failure	memory_test
DIAG-200	PCIe link degradation	pcie_validation
DIAG-300	Temperature threshold breach	gpu_health
DIAG-400	ECC error (SBE)	ecc_health
DIAG-401	ECC error (DBE)	ecc_health
DIAG-500	Clock throttle detected	clock_throttle
DIAG-600	Compute stress failure	compute_stress
DIAG-FI-*	Injected fault (test only)	fault_injection

Run History

Every diag run appends a one-line JSON entry to reports/.run_history.jsonl. The history command reads this file and renders a summary table:

$ python -m src.main history

        Diagnostic Run History
┌────────────────────┬──────────┬────────┬────────┬───────┬────────┬────────┬──────────┐
│ Timestamp          │ Run ID   │ Level  │ Status │ Tests │ Failed │ Warned │ Duration │
├────────────────────┼──────────┼────────┼────────┼───────┼────────┼────────┼──────────┤
│ 2026-03-24T09:14:02│ a3f1b2c4 │ medium │  PASS  │ 7     │ 0      │ 0      │ 4.8s     │
│ 2026-03-24T08:55:17│ 91d7e6a0 │ quick  │  FAIL  │ 6     │ 1      │ 0      │ 0.9s     │
└────────────────────┴──────────┴────────┴────────┴───────┴────────┴────────┴──────────┘

Entries are stored in reverse-chronological order. Use --failures to filter to failed runs only, and --limit N to control how many entries are shown (default: 20).

Prometheus Metrics

Exported at http://localhost:9835/metrics using prometheus_client (standard Prometheus exposition format):

# HELP gpu_temperature_celsius Current GPU temperature
# TYPE gpu_temperature_celsius gauge
gpu_temperature_celsius{gpu="0"} 47.0

# HELP gpu_power_draw_watts Current GPU power draw
# TYPE gpu_power_draw_watts gauge
gpu_power_draw_watts{gpu="0"} 30.9

# HELP gpu_memory_used_mib GPU VRAM usage in MiB
# TYPE gpu_memory_used_mib gauge
gpu_memory_used_mib{gpu="0"} 2054.0

# HELP gpu_diagnostic_status Diagnostic test status (1=pass, 0=fail, 2=warn, 3=skip)
# TYPE gpu_diagnostic_status gauge
gpu_diagnostic_status{test="deployment.driver_loaded",gpu_uuid=""} 1.0

# HELP gpu_diagnostic_run_total Total diagnostic runs
# TYPE gpu_diagnostic_run_total counter
gpu_diagnostic_run_total 3.0

Status values: 1 = PASS, 0 = FAIL, 2 = WARN, 3 = SKIP

The /health endpoint returns {"status": "ok"} for load balancer health checks.

Alerting Rules

Pre-configured Prometheus alerts in config/prometheus/alerts.yml:

Alert	Condition	Severity
GPUTemperatureCritical	> 85°C for 2m	critical
GPUTemperatureWarning	> 75°C for 5m	warning
GPUDiagnosticFailed	Any test fails	critical
GPUPowerExcessive	> 290W for 2m	warning
GPUECCDoublebitError	DBE count > 0	critical
GPUECCSinglebitRising	SBE rate > 0.1/hr	warning

Hardware Profiles

GPU-specific thresholds in config/profiles/:

# config/profiles/rtx_5070ti.yaml
gpu_model: "NVIDIA GeForce RTX 5070 Ti"
gpu_count: 1
pcie_gen_expected: 4
pcie_width_expected: 16
temp_warning_c: 80
temp_critical_c: 89
power_limit_w: 300

Included profiles: RTX 5070 Ti, A100 80GB SXM, H100 SXM.

Testing

pytest tests/                  # 208 tests, 0 warnings
ruff check src/ tests/         # Lint (all checks pass)

Test File	Items	Coverage
`test_production.py`	23	Prometheus metrics exposition format, `/metrics` and `/health` HTTP endpoints
`test_telemetry.py`	22	XID error detection, clock throttle reason classification, ECC error counters
`test_pcie_validation_diag.py`	14	PCIe gen/width/replay detection, degradation summary
`test_history.py`	14	JSONL run persistence, newest-first ordering, `--failures` filter, malformed line handling
`test_fault_injection.py`	30	5 fault types × parametrize — FAIL status, `DIAG-FI-*` prefix, injected flag, no collision with real codes
`test_stress.py`	12	Compute stress, SM saturation
`test_runner.py`	12	Test orchestration, result aggregation
`test_run_levels.py`	12	quick / medium / long / extended level configuration
`test_interconnect.py`	12	NVLink P2P, topology mapping
`test_gpu_health.py`	12	Temperature, power, VRAM availability, clock responsiveness
`test_deployment.py`	12	Driver load, GPU count/model/ECC, unique DIAG-001–004 codes, nvml session lifecycle
`test_bandwidth.py`	12	Host-to-device, device-to-host, and memory triad bandwidth
`test_memory_test.py`	10	VRAM allocation, pattern verification
`test_cli.py`	7	`monitor` KeyboardInterrupt handling, `history` rendering and flag filters
`test_pcie_validation.py`	4	PCIe link speed/width pass/fail cases
Total	208

PytestCollectionWarning suppression is configured in pyproject.toml.

CI/CD

GitHub Actions runs on every push/PR to master:

Ruff linting
Full test suite on Python 3.11 and 3.13
JUnit XML artifact upload

Requirements

Package	Purpose
`nvidia-ml-py`	pynvml bindings for GPU hardware access
`psutil`	CPU, RAM, and OS system info
`click`	CLI framework
`pyyaml`	Hardware profile configuration
`rich`	Terminal output tables and live display
`torch`	Compute stress and memory bandwidth tests
`prometheus-client`	Prometheus exposition format exporter
`sqlalchemy`	Database layer (placeholder, not active)
`psycopg2-binary`	PostgreSQL driver (placeholder)

Python 3.11+
NVIDIA GPU with driver installed
Docker + NVIDIA Container Toolkit (for containerized deployment)

Platform Notes

Windows (WDDM): On Windows, nvmlDeviceGetComputeRunningProcesses returns all processes with a GPU context — including the desktop compositor, browsers, and system UI — not just CUDA workloads. The gpu_processes check filters to processes consuming >100 MB VRAM to correctly distinguish compute workloads from display processes. On Linux servers this filter has no effect.

ECC: Consumer GPUs (GeForce series) do not support ECC. deployment.ecc_mode and telemetry.ecc_health report SKIP on these devices — this is expected behavior, not a fault.

Clock throttle: App-clock-limiting (clocks_event_reason_applications_clocks_setting) on consumer GPUs reflects user-configured application clock caps, not a hardware fault. This state is classified as PASS.

NCCL validation (simulated): nccl_validation.py runs an in-process simulation of AllReduce and AllGather — it measures PCIe P2P bandwidth between GPUs but does not initialize torch.distributed or invoke the NCCL library. True NCCL collective op benchmarking requires multi-process execution (torchrun/mpirun) on a multi-GPU node. The current test validates computation correctness and raw P2P throughput; full NCCL integration is a planned enhancement.

Database persistence: The database/ directory and sqlalchemy/psycopg2-binary dependencies are included for a planned SQLAlchemy persistence layer. File-based run history is available now via reports/.run_history.jsonl.

Observability

The metrics exporter uses prometheus_client (push model via /metrics). OpenTelemetry (OTEL) exporter support is planned — this will enable traces, metrics, and logs via a single OTEL collector rather than a dedicated Prometheus scraper. The Prometheus endpoint will remain for backwards compatibility.

Roadmap

Enhancement	Status
OpenTelemetry exporter (`opentelemetry-sdk`)	Planned
ExLlamaV2 inference benchmark module	Planned
TensorRT-LLM throughput benchmark (fp8 vs bf16, sm_120)	Planned
NCCL true multi-process collective validation (`torchrun`)	Planned
SQLAlchemy result persistence (database/ layer)	Planned

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
.github		.github
config		config
docs		docs
reports		reports
scripts		scripts
src		src
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
ROADMAP.md		ROADMAP.md
SECURITY.md		SECURITY.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPU Server Diagnostic Test Suite

Why

Features

Quick Start

Docker Compose

Architecture

Diagnostic Levels

Execution Modes

Fault Injection

Failure Codes

Run History

Prometheus Metrics

Alerting Rules

Hardware Profiles

Testing

CI/CD

Requirements

Platform Notes

Observability

Roadmap

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GPU Server Diagnostic Test Suite

Why

Features

Quick Start

Docker Compose

Architecture

Diagnostic Levels

Execution Modes

Fault Injection

Failure Codes

Run History

Prometheus Metrics

Alerting Rules

Hardware Profiles

Testing

CI/CD

Requirements

Platform Notes

Observability

Roadmap

License

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages