feat: add device-api-server with NVML fallback provider #720

ArangoGutierrez · 2026-01-21T21:27:26Z

Summary

Implements the Device API Server - a gRPC server that provides a unified interface for GPU device information, supporting both external providers (device plugins) and a built-in NVML fallback provider for direct GPU enumeration and health monitoring.

Key components:

gRPC Services: GpuService (consumer API) and ProviderService (provider API)
In-memory cache with RWMutex for thread-safe GPU state management
Watch broadcaster for real-time GPU state change notifications
NVML fallback provider for direct GPU enumeration and XID-based health monitoring
Prometheus metrics for observability (cache stats, NVML status, gRPC latency)
Helm chart for Kubernetes deployment with ServiceMonitor and alerting rules

11 commits, 54 files, ~12.8k lines

Type of Change

✨ New feature (new message, field, or service method)
📚 Documentation
🔧 Build/Tooling

Checklist

Proto files compile successfully (make protos-generate)
Generated code is up to date and committed
Self-review completed
Documentation updated (if needed)
Signed-off commits (DCO)

Test Results

35 unit tests passing with -race flag
go vet clean
golangci-lint clean (0 issues)

Copilot

Pull request overview

This pull request implements a comprehensive Device API Server with NVML fallback provider support for GPU device information and health monitoring. The server acts as a node-local gRPC cache intermediary between providers (health monitors) and consumers (device plugins, DRA drivers).

Changes:

Implements complete Device API Server with gRPC services (GpuService for consumers, ProviderService for providers)
Adds built-in NVML fallback provider for direct GPU enumeration and XID-based health monitoring
Implements thread-safe in-memory cache with RWMutex for read-blocking semantics during writes
Adds watch broadcaster for real-time GPU state change notifications
Includes comprehensive Prometheus metrics for observability
Provides Helm charts for Kubernetes deployment with ServiceMonitor and alerting

Reviewed changes

Copilot reviewed 51 out of 54 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
pkg/version/version.go	Version information package with build-time ldflags support
pkg/deviceapiserver/config.go	Server configuration with flags, environment variables, and validation
pkg/deviceapiserver/server.go	Main server implementation with lifecycle management and graceful shutdown
pkg/deviceapiserver/cache/*.go	Thread-safe GPU cache with broadcaster for watch events
pkg/deviceapiserver/service/*.go	gRPC service implementations (consumer and provider APIs)
pkg/deviceapiserver/nvml/*.go	NVML provider for GPU enumeration and XID health monitoring
pkg/deviceapiserver/metrics/metrics.go	Prometheus metrics definitions and recording
cmd/device-api-server/main.go	Main entry point with flag parsing and signal handling
docs/*.md	Comprehensive API, operations, and design documentation
charts/device-api-server/*	Helm chart templates for Kubernetes deployment
go.mod/go.sum	Dependencies including NVML, gRPC, and Prometheus libraries

Copilot · 2026-01-21T21:35:23Z

docs/operations/device-api-server.md

+- gRPC uses plaintext (designed for node-local communication)
+- Unix socket preferred over TCP for local clients
+- Consider NetworkPolicy to restrict access


The "Network Security" section documents that the gRPC server uses plaintext and is intended for node-local communication, but the implementation also exposes the same unauthenticated ProviderService control-plane API over a TCP listener (default --grpc-address=:50051) with no built‑in TLS or authentication. In a multi-tenant or partially untrusted cluster, any pod that can reach the server pod IP and port can invoke ProviderService methods to register or modify GPUs, effectively allowing unauthorized manipulation of GPU health state and scheduling inputs. To reduce this risk, the design should either require TLS/mTLS or equivalent authentication for gRPC, and/or strongly recommend disabling the TCP listener by default so that provider RPCs are only reachable via a restricted Unix socket combined with NetworkPolicy.

Suggested change

- gRPC uses plaintext (designed for node-local communication)

- Unix socket preferred over TCP for local clients

- Consider NetworkPolicy to restrict access

- gRPC over TCP is **plaintext and unauthenticated** by default and is intended **only for node-local access**.

- Prefer using a Unix domain socket for all local clients and providers; avoid exposing the TCP gRPC listener (`--grpc-address`, default `:50051`) to the broader cluster network.

- In multi-tenant or partially untrusted clusters, strongly recommend **disabling the TCP listener** or binding it only to a node-local interface and using a restricted Unix socket combined with Kubernetes `NetworkPolicy` to limit access to the Device API Server pod.

- If TCP gRPC must be exposed beyond the node, terminate it behind TLS/mTLS or an equivalent authenticated tunnel (for example, a sidecar proxy or ingress with client auth) and enforce a least-privilege `NetworkPolicy`.

Add ProviderService with RPCs for GPU lifecycle management: - RegisterGpu / UnregisterGpu - UpdateGpuStatus / UpdateGpuCondition Also adds resource_version field to Gpu message for optimistic concurrency control. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

Implements the Device API Server - a gRPC server providing unified GPU device information via GpuService (consumer) and ProviderService (provider) APIs. Components: - In-memory cache with RWMutex for thread-safe GPU state - Watch broadcaster for real-time state change notifications - NVML fallback provider for GPU enumeration and XID health monitoring - Prometheus metrics (cache, watch, NVML, gRPC stats) - Helm chart with ServiceMonitor and alerting rules Includes comprehensive unit tests (35 tests, race-clean) and documentation (API reference, operations guide, design docs). Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

ArangoGutierrez requested review from Copilot, dims, lalitadithya and pteranodan January 21, 2026 21:27

Copilot started reviewing on behalf of ArangoGutierrez January 21, 2026 21:28 View session

Copilot AI reviewed Jan 21, 2026

View reviewed changes

ArangoGutierrez added 2 commits January 21, 2026 22:43

ArangoGutierrez force-pushed the device-api-server branch from 224597b to b907fb1 Compare January 21, 2026 21:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add device-api-server with NVML fallback provider #720

feat: add device-api-server with NVML fallback provider #720

Uh oh!

ArangoGutierrez commented Jan 21, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

-- gRPC uses plaintext (designed for node-local communication)
-- Unix socket preferred over TCP for local clients
-- Consider NetworkPolicy to restrict access
+- gRPC over TCP is **plaintext and unauthenticated** by default and is intended **only for node-local access**.
+- Prefer using a Unix domain socket for all local clients and providers; avoid exposing the TCP gRPC listener (`--grpc-address`, default `:50051`) to the broader cluster network.
+- In multi-tenant or partially untrusted clusters, strongly recommend **disabling the TCP listener** or binding it only to a node-local interface and using a restricted Unix socket combined with Kubernetes `NetworkPolicy` to limit access to the Device API Server pod.
+- If TCP gRPC must be exposed beyond the node, terminate it behind TLS/mTLS or an equivalent authenticated tunnel (for example, a sidecar proxy or ingress with client auth) and enforce a least-privilege `NetworkPolicy`.

feat: add device-api-server with NVML fallback provider #720

Are you sure you want to change the base?

feat: add device-api-server with NVML fallback provider #720

Uh oh!

Conversation

ArangoGutierrez commented Jan 21, 2026

Summary

Type of Change

Checklist

Test Results

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant