Skip to content

Conversation

@ArangoGutierrez
Copy link
Contributor

Summary

Implements the Device API Server - a gRPC server that provides a unified interface for GPU device information, supporting both external providers (device plugins) and a built-in NVML fallback provider for direct GPU enumeration and health monitoring.

Key components:

  • gRPC Services: GpuService (consumer API) and ProviderService (provider API)
  • In-memory cache with RWMutex for thread-safe GPU state management
  • Watch broadcaster for real-time GPU state change notifications
  • NVML fallback provider for direct GPU enumeration and XID-based health monitoring
  • Prometheus metrics for observability (cache stats, NVML status, gRPC latency)
  • Helm chart for Kubernetes deployment with ServiceMonitor and alerting rules

11 commits, 54 files, ~12.8k lines

Type of Change

  • ✨ New feature (new message, field, or service method)
  • 📚 Documentation
  • 🔧 Build/Tooling

Checklist

  • Proto files compile successfully (make protos-generate)
  • Generated code is up to date and committed
  • Self-review completed
  • Documentation updated (if needed)
  • Signed-off commits (DCO)

Test Results

  • 35 unit tests passing with -race flag
  • go vet clean
  • golangci-lint clean (0 issues)

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request implements a comprehensive Device API Server with NVML fallback provider support for GPU device information and health monitoring. The server acts as a node-local gRPC cache intermediary between providers (health monitors) and consumers (device plugins, DRA drivers).

Changes:

  • Implements complete Device API Server with gRPC services (GpuService for consumers, ProviderService for providers)
  • Adds built-in NVML fallback provider for direct GPU enumeration and XID-based health monitoring
  • Implements thread-safe in-memory cache with RWMutex for read-blocking semantics during writes
  • Adds watch broadcaster for real-time GPU state change notifications
  • Includes comprehensive Prometheus metrics for observability
  • Provides Helm charts for Kubernetes deployment with ServiceMonitor and alerting

Reviewed changes

Copilot reviewed 51 out of 54 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
pkg/version/version.go Version information package with build-time ldflags support
pkg/deviceapiserver/config.go Server configuration with flags, environment variables, and validation
pkg/deviceapiserver/server.go Main server implementation with lifecycle management and graceful shutdown
pkg/deviceapiserver/cache/*.go Thread-safe GPU cache with broadcaster for watch events
pkg/deviceapiserver/service/*.go gRPC service implementations (consumer and provider APIs)
pkg/deviceapiserver/nvml/*.go NVML provider for GPU enumeration and XID health monitoring
pkg/deviceapiserver/metrics/metrics.go Prometheus metrics definitions and recording
cmd/device-api-server/main.go Main entry point with flag parsing and signal handling
docs/*.md Comprehensive API, operations, and design documentation
charts/device-api-server/* Helm chart templates for Kubernetes deployment
go.mod/go.sum Dependencies including NVML, gRPC, and Prometheus libraries

Comment on lines +419 to +421
- gRPC uses plaintext (designed for node-local communication)
- Unix socket preferred over TCP for local clients
- Consider NetworkPolicy to restrict access
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "Network Security" section documents that the gRPC server uses plaintext and is intended for node-local communication, but the implementation also exposes the same unauthenticated ProviderService control-plane API over a TCP listener (default --grpc-address=:50051) with no built‑in TLS or authentication. In a multi-tenant or partially untrusted cluster, any pod that can reach the server pod IP and port can invoke ProviderService methods to register or modify GPUs, effectively allowing unauthorized manipulation of GPU health state and scheduling inputs. To reduce this risk, the design should either require TLS/mTLS or equivalent authentication for gRPC, and/or strongly recommend disabling the TCP listener by default so that provider RPCs are only reachable via a restricted Unix socket combined with NetworkPolicy.

Suggested change
- gRPC uses plaintext (designed for node-local communication)
- Unix socket preferred over TCP for local clients
- Consider NetworkPolicy to restrict access
- gRPC over TCP is **plaintext and unauthenticated** by default and is intended **only for node-local access**.
- Prefer using a Unix domain socket for all local clients and providers; avoid exposing the TCP gRPC listener (`--grpc-address`, default `:50051`) to the broader cluster network.
- In multi-tenant or partially untrusted clusters, strongly recommend **disabling the TCP listener** or binding it only to a node-local interface and using a restricted Unix socket combined with Kubernetes `NetworkPolicy` to limit access to the Device API Server pod.
- If TCP gRPC must be exposed beyond the node, terminate it behind TLS/mTLS or an equivalent authenticated tunnel (for example, a sidecar proxy or ingress with client auth) and enforce a least-privilege `NetworkPolicy`.

Copilot uses AI. Check for mistakes.
Add ProviderService with RPCs for GPU lifecycle management:
- RegisterGpu / UnregisterGpu
- UpdateGpuStatus / UpdateGpuCondition

Also adds resource_version field to Gpu message for optimistic
concurrency control.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Implements the Device API Server - a gRPC server providing unified
GPU device information via GpuService (consumer) and ProviderService
(provider) APIs.

Components:
- In-memory cache with RWMutex for thread-safe GPU state
- Watch broadcaster for real-time state change notifications
- NVML fallback provider for GPU enumeration and XID health monitoring
- Prometheus metrics (cache, watch, NVML, gRPC stats)
- Helm chart with ServiceMonitor and alerting rules

Includes comprehensive unit tests (35 tests, race-clean) and
documentation (API reference, operations guide, design docs).

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant