Production-grade GPU validation framework modeled on NVIDIA DCGM architecture. Runs multi-level hardware diagnostics, exports Prometheus metrics, and integrates with Grafana for real-time monitoring.
Built for data center reliability teams, ML infrastructure engineers, and GPU fleet operators.
GPU clusters fail silently — bad ECC memory, throttled clocks, degraded PCIe links — and broken hardware burns compute budget on garbage results. This test suite catches those failures before training starts. Modeled on NVIDIA's DCGM diagnostic framework, it runs 16 hardware validation modules, exports Prometheus metrics for fleet-wide monitoring, and generates JUnit XML reports that plug directly into CI. If you manage GPU infrastructure for ML workloads, this replaces ad-hoc nvidia-smi checks with systematic, repeatable diagnostics.
- Multi-level diagnostics — Quick (deployment checks), Medium (+ PCIe, memory, telemetry), Long (+ bandwidth, stress, topology), Extended (+ NCCL, burn-in)
- 16 diagnostic modules — Driver validation, GPU enumeration, PCIe gen/width/replay, VRAM allocation, pattern verification, XID errors, ECC health, clock throttling, compute stress, memory bandwidth, NVLink P2P, topology mapping, and more
- Prometheus metrics exporter — Real-time GPU telemetry on
:9835/metricsviaprometheus_client; compatible with any standard Prometheus scraper - Live GPU health monitor —
monitorcommand renders a continuously-refreshed Rich table (temperature, power, VRAM, clocks) at a configurable poll interval - Run history — Every
diaginvocation appends a summary entry toreports/.run_history.jsonl; thehistorycommand displays it as a table with pass/fail filtering - Fault injection — Five synthetic fault types (
thermal,ecc,pcie,clock,memory) via--inject-fault; failure codes use theDIAG-FI-*prefix to distinguish injected faults from real diagnostic failures - Docker Compose stack — One-command deployment with Prometheus + Grafana (auto-provisioned dashboards)
- Hardware profiles — Per-GPU threshold configs (RTX 5070 Ti, A100 80GB, H100 SXM included)
- Burn-in mode — Continuous stress testing with configurable duration (up to 24h)
- CI/CD integration — JUnit XML output, GitHub Actions pipeline, ruff linting
- Rich CLI — Colored terminal output with progress tables
# Install
pip install -e ".[dev]"
# GPU inventory
python -m src.main inventory
# Run diagnostics
python -m src.main diag --level quick # Deployment checks only (~1s)
python -m src.main diag --level medium # + PCIe, memory, telemetry (~5s)
python -m src.main diag --level long # + bandwidth, stress, topology (~30s)
python -m src.main diag --level extended # + NCCL, burn-in (~60s)
# Run a single named test
python -m src.main diag --test xid_errors
# Export results
python -m src.main diag --level long --output json
python -m src.main diag --level long --output junit --junit-file results.xml
# Burn-in mode (stress test for specified duration)
python -m src.main diag --mode burnin --duration 3600
# Pre-flight check before a training job
python -m src.main diag --level medium --mode preflight
# Inject a synthetic fault to verify failure handling
python -m src.main diag --level quick --inject-fault thermal
# Live GPU health monitor (Ctrl+C to stop)
python -m src.main monitor --interval 5
# View recent diagnostic run history
python -m src.main history
python -m src.main history --failures # Failed runs only
python -m src.main history --limit 50
# Start standalone Prometheus metrics server
python -m src.main metrics --port 9835
# GPU cleanup (reset clocks, power, CUDA context)
python -m src.main cleanupFull observability stack with one command:
docker compose up -d| Service | Port | Description |
|---|---|---|
| gpu-diag | 9835 | Prometheus metrics exporter |
| Prometheus | 9090 | Metrics storage and alerting |
| Grafana | 3000 | Dashboard visualization (admin/admin) |
Requires NVIDIA Container Toolkit.
graph TD
CLI["main.py<br/><i>Click CLI</i>"]
CLI --> INV["inventory/<br/>GPU Discovery"]
CLI --> DIAG["diagnostics/<br/>16 Modules"]
CLI --> MON["monitoring/<br/>Live Health Monitor"]
CLI --> RPT["reporting/"]
CLI --> FI["fault_injection/<br/>5 Fault Types"]
subgraph Diagnostics
DIAG --> DEP["deployment<br/><small>Driver, GPU count, ECC</small>"]
DIAG --> HEALTH["gpu_health<br/><small>Temp, power, VRAM, clocks</small>"]
DIAG --> PCIE["pcie_validation<br/><small>Gen, width, replay counters</small>"]
DIAG --> PCIEBW["pcie_bandwidth<br/><small>H2D / D2H throughput</small>"]
DIAG --> MEM["memory_test<br/><small>VRAM alloc + pattern verify</small>"]
DIAG --> MEMBW["memory_bandwidth<br/><small>HBM bandwidth</small>"]
DIAG --> COMP["compute_stress<br/><small>SM occupancy stress</small>"]
DIAG --> SM["sm_stress<br/><small>SM saturation</small>"]
DIAG --> PWR["power_test<br/><small>Power draw under load</small>"]
DIAG --> ECC["ecc_health<br/><small>SBE/DBE, row remapping</small>"]
DIAG --> XID["xid_errors<br/><small>XID event log analysis</small>"]
DIAG --> CLK["clock_throttle<br/><small>Throttle reason detection</small>"]
DIAG --> NVL["nvlink_p2p<br/><small>NVLink P2P validation</small>"]
DIAG --> NCCL["nccl_validation<br/><small>Collective ops (simulated)</small>"]
DIAG --> TOPO["topology_map<br/><small>PCIe/NVLink topology</small>"]
DIAG --> CLEAN["gpu_cleanup<br/><small>Post-test GPU reset</small>"]
end
subgraph Reporting
RPT --> RUNNER["test_runner<br/><small>Test orchestration</small>"]
RPT --> PROM["prometheus<br/><small>:9835/metrics</small>"]
RPT --> JUNIT["junit_xml<br/><small>CI/CD reports</small>"]
RPT --> HIST["history<br/><small>JSONL run history</small>"]
RPT --> MODELS["models<br/><small>TestResult, DiagnosticRun</small>"]
end
subgraph Observability ["Docker Compose Stack"]
PROM --> PROMETHEUS["Prometheus<br/><small>:9090</small>"]
PROMETHEUS --> GRAFANA["Grafana<br/><small>:3000</small>"]
end
DB["database/<br/><small>SQLAlchemy persistence<br/>(planned)</small>"]
RPT --> DB
style CLI fill:#4a90d9,color:#fff
style DIAG fill:#76b900,color:#fff
style RPT fill:#e6522c,color:#fff
style FI fill:#d94a4a,color:#fff
style PROMETHEUS fill:#e6522c,color:#fff
style GRAFANA fill:#f2a900,color:#fff
| Level | Tests | Duration | Use Case |
|---|---|---|---|
| quick | 1 | ~1s | Smoke test after provisioning |
| medium | 7 | ~5s | Pre-job validation |
| long | 14 | ~30s | Scheduled health checks |
| extended | 15 | ~60s | Full qualification / burn-in |
Level contents:
| Test Module | quick | medium | long | extended |
|---|---|---|---|---|
| deployment | ✓ | ✓ | ✓ | ✓ |
| gpu_health | ✓ | ✓ | ✓ | |
| pcie_validation | ✓ | ✓ | ✓ | |
| memory_test | ✓ | ✓ | ✓ | |
| xid_errors | ✓ | ✓ | ✓ | |
| clock_throttle | ✓ | ✓ | ✓ | |
| ecc_health | ✓ | ✓ | ✓ | |
| topology_map | ✓ | ✓ | ||
| pcie_bandwidth | ✓ | ✓ | ||
| memory_bandwidth | ✓ | ✓ | ||
| compute_stress | ✓ | ✓ | ||
| sm_stress | ✓ | ✓ | ||
| power_test | ✓ | ✓ | ||
| nvlink_p2p | ✓ | ✓ | ||
| nccl_validation | ✓ |
| Mode | Description | Stress Duration |
|---|---|---|
| standard | Normal test execution (default) | Profile default |
| preflight | Pre-job health check, shorter stress | 30s |
| burnin | New hardware qualification | Configurable |
# Pre-flight check before a training job
python -m src.main diag --level medium --mode preflight
# 8-hour burn-in for new hardware
python -m src.main diag --level extended --mode burnin --duration 28800The --inject-fault flag appends a synthetic FAIL result to the diagnostic run, allowing you to verify that alerting pipelines, JUnit reporting, and CI failure gates respond correctly without requiring actual hardware faults.
python -m src.main diag --level quick --inject-fault thermal
python -m src.main diag --level quick --inject-fault ecc
python -m src.main diag --level quick --inject-fault pcie
python -m src.main diag --level quick --inject-fault clock
python -m src.main diag --level quick --inject-fault memoryInjected results carry DIAG-FI-* failure codes, which are distinct from real diagnostic codes and can be filtered in alert rules.
| Fault | Simulated Condition | Failure Code |
|---|---|---|
| thermal | GPU temperature 95°C, exceeding 85°C threshold | DIAG-FI-300 |
| ecc | Double-bit ECC error (DBE count = 1) | DIAG-FI-401 |
| pcie | PCIe link degraded to Gen4 x8 (expected x16) | DIAG-FI-202 |
| clock | Clock throttle: SW_THERMAL_SLOWDOWN active | DIAG-FI-501 |
| memory | VRAM stress failure at iteration 512 | DIAG-FI-102 |
Real diagnostic failures use numeric codes to identify the specific check that failed:
| Code | Check | Module |
|---|---|---|
| DIAG-001 | Driver not loaded | deployment |
| DIAG-002 | GPU count mismatch | deployment |
| DIAG-003 | GPU model mismatch | deployment |
| DIAG-004 | ECC mode mismatch | deployment |
| DIAG-100 | VRAM allocation failure | memory_test |
| DIAG-200 | PCIe link degradation | pcie_validation |
| DIAG-300 | Temperature threshold breach | gpu_health |
| DIAG-400 | ECC error (SBE) | ecc_health |
| DIAG-401 | ECC error (DBE) | ecc_health |
| DIAG-500 | Clock throttle detected | clock_throttle |
| DIAG-600 | Compute stress failure | compute_stress |
| DIAG-FI-* | Injected fault (test only) | fault_injection |
Every diag run appends a one-line JSON entry to reports/.run_history.jsonl. The history command reads this file and renders a summary table:
$ python -m src.main history
Diagnostic Run History
┌────────────────────┬──────────┬────────┬────────┬───────┬────────┬────────┬──────────┐
│ Timestamp │ Run ID │ Level │ Status │ Tests │ Failed │ Warned │ Duration │
├────────────────────┼──────────┼────────┼────────┼───────┼────────┼────────┼──────────┤
│ 2026-03-24T09:14:02│ a3f1b2c4 │ medium │ PASS │ 7 │ 0 │ 0 │ 4.8s │
│ 2026-03-24T08:55:17│ 91d7e6a0 │ quick │ FAIL │ 6 │ 1 │ 0 │ 0.9s │
└────────────────────┴──────────┴────────┴────────┴───────┴────────┴────────┴──────────┘
Entries are stored in reverse-chronological order. Use --failures to filter to failed runs only, and --limit N to control how many entries are shown (default: 20).
Exported at http://localhost:9835/metrics using prometheus_client (standard Prometheus exposition format):
# HELP gpu_temperature_celsius Current GPU temperature
# TYPE gpu_temperature_celsius gauge
gpu_temperature_celsius{gpu="0"} 47.0
# HELP gpu_power_draw_watts Current GPU power draw
# TYPE gpu_power_draw_watts gauge
gpu_power_draw_watts{gpu="0"} 30.9
# HELP gpu_memory_used_mib GPU VRAM usage in MiB
# TYPE gpu_memory_used_mib gauge
gpu_memory_used_mib{gpu="0"} 2054.0
# HELP gpu_diagnostic_status Diagnostic test status (1=pass, 0=fail, 2=warn, 3=skip)
# TYPE gpu_diagnostic_status gauge
gpu_diagnostic_status{test="deployment.driver_loaded",gpu_uuid=""} 1.0
# HELP gpu_diagnostic_run_total Total diagnostic runs
# TYPE gpu_diagnostic_run_total counter
gpu_diagnostic_run_total 3.0
Status values: 1 = PASS, 0 = FAIL, 2 = WARN, 3 = SKIP
The /health endpoint returns {"status": "ok"} for load balancer health checks.
Pre-configured Prometheus alerts in config/prometheus/alerts.yml:
| Alert | Condition | Severity |
|---|---|---|
| GPUTemperatureCritical | > 85°C for 2m | critical |
| GPUTemperatureWarning | > 75°C for 5m | warning |
| GPUDiagnosticFailed | Any test fails | critical |
| GPUPowerExcessive | > 290W for 2m | warning |
| GPUECCDoublebitError | DBE count > 0 | critical |
| GPUECCSinglebitRising | SBE rate > 0.1/hr | warning |
GPU-specific thresholds in config/profiles/:
# config/profiles/rtx_5070ti.yaml
gpu_model: "NVIDIA GeForce RTX 5070 Ti"
gpu_count: 1
pcie_gen_expected: 4
pcie_width_expected: 16
temp_warning_c: 80
temp_critical_c: 89
power_limit_w: 300Included profiles: RTX 5070 Ti, A100 80GB SXM, H100 SXM.
pytest tests/ # 208 tests, 0 warnings
ruff check src/ tests/ # Lint (all checks pass)| Test File | Items | Coverage |
|---|---|---|
test_production.py |
23 | Prometheus metrics exposition format, /metrics and /health HTTP endpoints |
test_telemetry.py |
22 | XID error detection, clock throttle reason classification, ECC error counters |
test_pcie_validation_diag.py |
14 | PCIe gen/width/replay detection, degradation summary |
test_history.py |
14 | JSONL run persistence, newest-first ordering, --failures filter, malformed line handling |
test_fault_injection.py |
30 | 5 fault types × parametrize — FAIL status, DIAG-FI-* prefix, injected flag, no collision with real codes |
test_stress.py |
12 | Compute stress, SM saturation |
test_runner.py |
12 | Test orchestration, result aggregation |
test_run_levels.py |
12 | quick / medium / long / extended level configuration |
test_interconnect.py |
12 | NVLink P2P, topology mapping |
test_gpu_health.py |
12 | Temperature, power, VRAM availability, clock responsiveness |
test_deployment.py |
12 | Driver load, GPU count/model/ECC, unique DIAG-001–004 codes, nvml session lifecycle |
test_bandwidth.py |
12 | Host-to-device, device-to-host, and memory triad bandwidth |
test_memory_test.py |
10 | VRAM allocation, pattern verification |
test_cli.py |
7 | monitor KeyboardInterrupt handling, history rendering and flag filters |
test_pcie_validation.py |
4 | PCIe link speed/width pass/fail cases |
| Total | 208 |
PytestCollectionWarning suppression is configured in pyproject.toml.
GitHub Actions runs on every push/PR to master:
- Ruff linting
- Full test suite on Python 3.11 and 3.13
- JUnit XML artifact upload
| Package | Purpose |
|---|---|
nvidia-ml-py |
pynvml bindings for GPU hardware access |
psutil |
CPU, RAM, and OS system info |
click |
CLI framework |
pyyaml |
Hardware profile configuration |
rich |
Terminal output tables and live display |
torch |
Compute stress and memory bandwidth tests |
prometheus-client |
Prometheus exposition format exporter |
sqlalchemy |
Database layer (placeholder, not active) |
psycopg2-binary |
PostgreSQL driver (placeholder) |
- Python 3.11+
- NVIDIA GPU with driver installed
- Docker + NVIDIA Container Toolkit (for containerized deployment)
Windows (WDDM): On Windows, nvmlDeviceGetComputeRunningProcesses returns all processes with a GPU context — including the desktop compositor, browsers, and system UI — not just CUDA workloads. The gpu_processes check filters to processes consuming >100 MB VRAM to correctly distinguish compute workloads from display processes. On Linux servers this filter has no effect.
ECC: Consumer GPUs (GeForce series) do not support ECC. deployment.ecc_mode and telemetry.ecc_health report SKIP on these devices — this is expected behavior, not a fault.
Clock throttle: App-clock-limiting (clocks_event_reason_applications_clocks_setting) on consumer GPUs reflects user-configured application clock caps, not a hardware fault. This state is classified as PASS.
NCCL validation (simulated): nccl_validation.py runs an in-process simulation of AllReduce and AllGather — it measures PCIe P2P bandwidth between GPUs but does not initialize torch.distributed or invoke the NCCL library. True NCCL collective op benchmarking requires multi-process execution (torchrun/mpirun) on a multi-GPU node. The current test validates computation correctness and raw P2P throughput; full NCCL integration is a planned enhancement.
Database persistence: The database/ directory and sqlalchemy/psycopg2-binary dependencies are included for a planned SQLAlchemy persistence layer. File-based run history is available now via reports/.run_history.jsonl.
The metrics exporter uses prometheus_client (push model via /metrics). OpenTelemetry (OTEL) exporter support is planned — this will enable traces, metrics, and logs via a single OTEL collector rather than a dedicated Prometheus scraper. The Prometheus endpoint will remain for backwards compatibility.
| Enhancement | Status |
|---|---|
OpenTelemetry exporter (opentelemetry-sdk) |
Planned |
| ExLlamaV2 inference benchmark module | Planned |
| TensorRT-LLM throughput benchmark (fp8 vs bf16, sm_120) | Planned |
NCCL true multi-process collective validation (torchrun) |
Planned |
| SQLAlchemy result persistence (database/ layer) | Planned |
MIT