feat: instance usage monitoring with metrics-server integration#98
Open
hippoley wants to merge 1 commit intoYuan-lab-LLM:mainfrom
Open
feat: instance usage monitoring with metrics-server integration#98hippoley wants to merge 1 commit intoYuan-lab-LLM:mainfrom
hippoley wants to merge 1 commit intoYuan-lab-LLM:mainfrom
Conversation
Add real-time resource usage collection for running instances by polling the Kubernetes metrics-server API. The collector runs as a background goroutine and gracefully degrades to uptime-only recording when the metrics-server is unavailable. New files: - repository/instance_usage_repository.go: CRUD + aggregation queries for the existing instance_usage table (ensureTable as safety net) - services/instance_usage_service.go: business logic with input validation, 24h default / 30-day cap on history queries - services/instance_usage_collector.go: scheduled collector using existing rest.Config (no new dependencies), with sync.Once Stop(), panic recovery, TryLock overlap guard, and configurable interval via USAGE_COLLECT_INTERVAL env var - handlers/instance_usage_handler.go: REST endpoints for current usage, history, and admin summary - services/instance_usage_service_test.go: 11 tests covering CPU/memory quantity parsing, service logic, and edge cases Modified: - cmd/server/main.go: wire repository, service, collector, handler, routes, and graceful shutdown API endpoints: - GET /instances/:id/usage/current (user-scoped) - GET /instances/:id/usage/history?hours=24 (user-scoped) - GET /admin/instances/usage/summary (admin-only) Design decisions: - No new go.mod dependencies: uses clientset.RESTClient().AbsPath() to call metrics-server API directly - GPU and disk metrics left as nil (requires DCGM / kubelet stats API) - CPU reported as percentage of instance allocated cores - Memory reported in GB - Uptime calculated from pod startTime
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements real-time instance resource usage monitoring by polling the Kubernetes metrics-server API. The
instance_usagetable already exists in the DB schema (created by the init migration) but had zero implementation — this PR fills that gap entirely.What's New
Repository Layer
InstanceUsageRepositorywith 5 methods:Create,GetLatestByInstanceID,ListByInstanceID,ListLatestPerInstance,DeleteOlderThanensureTable()as an idempotent safety net (table already exists in migration)ListLatestPerInstance()uses a correlated subquery withMAX(recorded_at)for efficient admin overviewService Layer
InstanceUsageServicewith input validation, 24h default / 30-day cap on history queriesInstanceUsageCollector— background goroutine that periodically collects metrics for all running instancesCollector Design
clientset.RESTClient().AbsPath("/apis/metrics.k8s.io/v1beta1/...")— no new go.mod dependenciessync.OnceStop(), panic recovery,sync.Mutex.TryLock()overlap guardUSAGE_COLLECT_INTERVALenv var (default: 60s)startTimenil(requires DCGM / kubelet stats API — planned for Part 2)API Endpoints
/instances/:id/usage/current/instances/:id/usage/history?hours=24/admin/instances/usage/summaryWiring
main.goTesting
Design Decisions
k8s.io/metricsmain.gowiringFiles Changed
repository/instance_usage_repository.goservices/instance_usage_service.goservices/instance_usage_collector.gohandlers/instance_usage_handler.goservices/instance_usage_service_test.gocmd/server/main.go