-
Notifications
You must be signed in to change notification settings - Fork 36
feat: add device-api-server with NVML fallback provider #720
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: device-api
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request implements a comprehensive Device API Server with NVML fallback provider support for GPU device information and health monitoring. The server acts as a node-local gRPC cache intermediary between providers (health monitors) and consumers (device plugins, DRA drivers).
Changes:
- Implements complete Device API Server with gRPC services (GpuService for consumers, ProviderService for providers)
- Adds built-in NVML fallback provider for direct GPU enumeration and XID-based health monitoring
- Implements thread-safe in-memory cache with RWMutex for read-blocking semantics during writes
- Adds watch broadcaster for real-time GPU state change notifications
- Includes comprehensive Prometheus metrics for observability
- Provides Helm charts for Kubernetes deployment with ServiceMonitor and alerting
Reviewed changes
Copilot reviewed 51 out of 54 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| pkg/version/version.go | Version information package with build-time ldflags support |
| pkg/deviceapiserver/config.go | Server configuration with flags, environment variables, and validation |
| pkg/deviceapiserver/server.go | Main server implementation with lifecycle management and graceful shutdown |
| pkg/deviceapiserver/cache/*.go | Thread-safe GPU cache with broadcaster for watch events |
| pkg/deviceapiserver/service/*.go | gRPC service implementations (consumer and provider APIs) |
| pkg/deviceapiserver/nvml/*.go | NVML provider for GPU enumeration and XID health monitoring |
| pkg/deviceapiserver/metrics/metrics.go | Prometheus metrics definitions and recording |
| cmd/device-api-server/main.go | Main entry point with flag parsing and signal handling |
| docs/*.md | Comprehensive API, operations, and design documentation |
| charts/device-api-server/* | Helm chart templates for Kubernetes deployment |
| go.mod/go.sum | Dependencies including NVML, gRPC, and Prometheus libraries |
| - gRPC uses plaintext (designed for node-local communication) | ||
| - Unix socket preferred over TCP for local clients | ||
| - Consider NetworkPolicy to restrict access |
Copilot
AI
Jan 21, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "Network Security" section documents that the gRPC server uses plaintext and is intended for node-local communication, but the implementation also exposes the same unauthenticated ProviderService control-plane API over a TCP listener (default --grpc-address=:50051) with no built‑in TLS or authentication. In a multi-tenant or partially untrusted cluster, any pod that can reach the server pod IP and port can invoke ProviderService methods to register or modify GPUs, effectively allowing unauthorized manipulation of GPU health state and scheduling inputs. To reduce this risk, the design should either require TLS/mTLS or equivalent authentication for gRPC, and/or strongly recommend disabling the TCP listener by default so that provider RPCs are only reachable via a restricted Unix socket combined with NetworkPolicy.
| - gRPC uses plaintext (designed for node-local communication) | |
| - Unix socket preferred over TCP for local clients | |
| - Consider NetworkPolicy to restrict access | |
| - gRPC over TCP is **plaintext and unauthenticated** by default and is intended **only for node-local access**. | |
| - Prefer using a Unix domain socket for all local clients and providers; avoid exposing the TCP gRPC listener (`--grpc-address`, default `:50051`) to the broader cluster network. | |
| - In multi-tenant or partially untrusted clusters, strongly recommend **disabling the TCP listener** or binding it only to a node-local interface and using a restricted Unix socket combined with Kubernetes `NetworkPolicy` to limit access to the Device API Server pod. | |
| - If TCP gRPC must be exposed beyond the node, terminate it behind TLS/mTLS or an equivalent authenticated tunnel (for example, a sidecar proxy or ingress with client auth) and enforce a least-privilege `NetworkPolicy`. |
Add ProviderService with RPCs for GPU lifecycle management: - RegisterGpu / UnregisterGpu - UpdateGpuStatus / UpdateGpuCondition Also adds resource_version field to Gpu message for optimistic concurrency control. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Implements the Device API Server - a gRPC server providing unified GPU device information via GpuService (consumer) and ProviderService (provider) APIs. Components: - In-memory cache with RWMutex for thread-safe GPU state - Watch broadcaster for real-time state change notifications - NVML fallback provider for GPU enumeration and XID health monitoring - Prometheus metrics (cache, watch, NVML, gRPC stats) - Helm chart with ServiceMonitor and alerting rules Includes comprehensive unit tests (35 tests, race-clean) and documentation (API reference, operations guide, design docs). Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
224597b to
b907fb1
Compare
Summary
Implements the Device API Server - a gRPC server that provides a unified interface for GPU device information, supporting both external providers (device plugins) and a built-in NVML fallback provider for direct GPU enumeration and health monitoring.
Key components:
GpuService(consumer API) andProviderService(provider API)11 commits, 54 files, ~12.8k lines
Type of Change
Checklist
make protos-generate)Test Results
-raceflaggo vetcleangolangci-lintclean (0 issues)