Skip to content

feat: instance usage monitoring with metrics-server integration#98

Open
hippoley wants to merge 1 commit intoYuan-lab-LLM:mainfrom
hippoley:feat/instance-usage-monitoring
Open

feat: instance usage monitoring with metrics-server integration#98
hippoley wants to merge 1 commit intoYuan-lab-LLM:mainfrom
hippoley:feat/instance-usage-monitoring

Conversation

@hippoley
Copy link
Copy Markdown
Contributor

Summary

Implements real-time instance resource usage monitoring by polling the Kubernetes metrics-server API. The instance_usage table already exists in the DB schema (created by the init migration) but had zero implementation — this PR fills that gap entirely.

What's New

Repository Layer

  • InstanceUsageRepository with 5 methods: Create, GetLatestByInstanceID, ListByInstanceID, ListLatestPerInstance, DeleteOlderThan
  • ensureTable() as an idempotent safety net (table already exists in migration)
  • ListLatestPerInstance() uses a correlated subquery with MAX(recorded_at) for efficient admin overview

Service Layer

  • InstanceUsageService with input validation, 24h default / 30-day cap on history queries
  • InstanceUsageCollector — background goroutine that periodically collects metrics for all running instances

Collector Design

  • Polls K8s metrics-server via clientset.RESTClient().AbsPath("/apis/metrics.k8s.io/v1beta1/...")no new go.mod dependencies
  • Graceful degradation: when metrics-server is unavailable, records uptime only (no crash, no error spam)
  • Hardened with sync.Once Stop(), panic recovery, sync.Mutex.TryLock() overlap guard
  • Configurable interval via USAGE_COLLECT_INTERVAL env var (default: 60s)
  • CPU reported as percentage of instance's allocated cores (millicores → percent)
  • Memory reported in GB
  • Uptime calculated from pod startTime
  • GPU and disk left as nil (requires DCGM / kubelet stats API — planned for Part 2)

API Endpoints

Method Path Scope Description
GET /instances/:id/usage/current User Latest usage snapshot for an instance
GET /instances/:id/usage/history?hours=24 User Historical usage records (default 24h, max 30d)
GET /admin/instances/usage/summary Admin Latest usage for all instances

Wiring

  • Repository, service, collector, and handler initialized in main.go
  • Collector starts after sync service, stops during graceful shutdown
  • Routes registered under existing instance and admin groups

Testing

  • 11 new tests covering CPU/memory quantity parsing (millicores, nanocores, Mi/Gi/Ki/Ti, SI units), service logic, and edge cases
  • All existing tests pass with zero regressions

Design Decisions

  1. No new dependencies: Raw REST calls to metrics-server instead of adding k8s.io/metrics
  2. Non-breaking: Pure additive — no existing code modified except main.go wiring
  3. Graceful degradation: Works without metrics-server (uptime-only mode)
  4. Same hardening patterns as backup_scheduler: panic recovery, overlap guard, safe Stop()

Files Changed

File Change
repository/instance_usage_repository.go New
services/instance_usage_service.go New
services/instance_usage_collector.go New
handlers/instance_usage_handler.go New
services/instance_usage_service_test.go New
cmd/server/main.go Modified (+11 lines)

Add real-time resource usage collection for running instances by polling
the Kubernetes metrics-server API. The collector runs as a background
goroutine and gracefully degrades to uptime-only recording when the
metrics-server is unavailable.

New files:
- repository/instance_usage_repository.go: CRUD + aggregation queries
  for the existing instance_usage table (ensureTable as safety net)
- services/instance_usage_service.go: business logic with input
  validation, 24h default / 30-day cap on history queries
- services/instance_usage_collector.go: scheduled collector using
  existing rest.Config (no new dependencies), with sync.Once Stop(),
  panic recovery, TryLock overlap guard, and configurable interval
  via USAGE_COLLECT_INTERVAL env var
- handlers/instance_usage_handler.go: REST endpoints for current
  usage, history, and admin summary
- services/instance_usage_service_test.go: 11 tests covering CPU/memory
  quantity parsing, service logic, and edge cases

Modified:
- cmd/server/main.go: wire repository, service, collector, handler,
  routes, and graceful shutdown

API endpoints:
- GET /instances/:id/usage/current  (user-scoped)
- GET /instances/:id/usage/history?hours=24  (user-scoped)
- GET /admin/instances/usage/summary  (admin-only)

Design decisions:
- No new go.mod dependencies: uses clientset.RESTClient().AbsPath()
  to call metrics-server API directly
- GPU and disk metrics left as nil (requires DCGM / kubelet stats API)
- CPU reported as percentage of instance allocated cores
- Memory reported in GB
- Uptime calculated from pod startTime
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant