Skip to content

Add ResourceStatsDetail gRPC API#3435

Open
jumpei527 wants to merge 24 commits intomainfrom
feature/gateway/add-resource-stats-detail
Open

Add ResourceStatsDetail gRPC API#3435
jumpei527 wants to merge 24 commits intomainfrom
feature/gateway/add-resource-stats-detail

Conversation

@jumpei527
Copy link
Copy Markdown
Contributor

@jumpei527 jumpei527 commented Dec 23, 2025

Description

#3243 implemented the ResourceStats API for future resource control. This PR adds ResourceStatsDetail API to easily aggregate ResourceStats data from all agents.

What Changed:

  1. Proto updates
  • Added rpc.v1.StatsDetail service to apis/proto/v1/rpc/stats/stats.proto.
  • Added ResourceStatsDetail(payload.v1.Empty) returns (payload.v1.Info.Stats.ResourceStatsDetail).
  • Updated payload types under Info.Stats for detail response (ResourceStatsDetail).
  1. Service registration and implementation
  • Implemented ResourceStatsDetail on gateway LB handler to aggregate ResourceStats from agents.
  • Implemented ResourceStatsDetail on NGT/Faiss agent handlers to return self-only detail.
  • Registered StatsDetail service in gateway and agent gRPC registration paths.
  1. E2E updates
  • Added resource_stats_detail operation support in E2E v2 config/binding/strategy routing.
  • Added dedicated stats execution path (stats_test.go).
  • Added ResourceStatsDetail scenarios to unary_crud.yaml, stream_crud.yaml, and multi_crud.yaml.
  • Verified this change with E2E v2 scenarios on a k3d-deployed Vald cluster.

Related Issue

#3274

Versions

  • Vald Version: v1.7.17
  • Go Version: v1.25.5
  • Rust Version: v1.92.0
  • Docker Version: v29.1.3
  • Kubernetes Version: v1.34.3
  • Helm Version: v4.0.4
  • NGT Version: v2.7.1
  • Faiss Version: v1.13.1

Checklist

Special notes for your reviewer

Summary by CodeRabbit

  • New Features

    • Added /resource/stats/detail API and StatsDetail gRPC service to fetch per-agent resource statistics (CPU/memory) and aggregated details.
  • Tests

    • Added end-to-end tests covering the new stats detail operations.
  • Chores

    • Bumped dependencies: tokio, anyhow, wincode.

@jumpei527 jumpei527 self-assigned this Dec 23, 2025
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Dec 23, 2025

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR restructures the resource statistics API by introducing a hierarchical Info.Stats container with nested ResourceStats, ResourceStatsDetail, and CgroupStats types. It adds a new StatsDetail RPC service, implements resource stats detail handlers across agent cores (FAISS/NGT) and the gateway load balancer, wires client support, updates documentation and Swagger, adds e2e tests, and refreshes Rust dependencies and generated bindings.

Changes

Cohort / File(s) Summary
Proto & API docs
apis/proto/v1/payload/payload.proto, apis/proto/v1/rpc/stats/stats.proto, apis/docs/v1/payload.md.tmpl, apis/docs/v1/docs.md, apis/swagger/v1/rpc/stats/stats.swagger.json
Introduce Info.Stats with nested ResourceStats, ResourceStatsDetail, and CgroupStats; move previous ResourceStats into Info.Stats; add StatsDetail service and ResourceStatsDetail RPC and update docs/Swagger accordingly.
Vald client plumbing
apis/grpc/v1/vald/vald.go, internal/client/v1/client/vald/vald.go
Embed stats.StatsClient into Vald client interface/struct; instantiate stats client; add client ResourceStats method (trace + RoundRobin).
Agent handlers — FAISS
pkg/agent/core/faiss/handler/grpc/handler.go, pkg/agent/core/faiss/handler/grpc/stats.go, pkg/agent/core/faiss/usecase/agentd.go
Embed stats.StatsDetailServer, register service, implement ResourceStatsDetail handler that calls GetResourceStats and returns a single-entry details map keyed by agent name.
Agent handlers — NGT
pkg/agent/core/ngt/handler/grpc/handler.go, pkg/agent/core/ngt/handler/grpc/stats.go, pkg/agent/core/ngt/usecase/agentd.go
Same as FAISS: embed StatsDetailServer, register service, implement ResourceStatsDetail handler returning agent-specific stats.
Gateway LB
pkg/gateway/lb/handler/grpc/handler.go, pkg/gateway/lb/handler/grpc/stats.go, pkg/gateway/lb/usecase/vald.go
Extend server interface to include StatsDetailServer, change constructor return type, register service, implement ResourceStatsDetail that concurrently broadcasts ResourceStats to agents, aggregates per-agent details, records per-call spans/statuses, and surfaces errors with resource metadata.
Internal stats util & tests
internal/net/grpc/stats/stats.go, internal/net/grpc/stats/stats_test.go
Change return types from Info_ResourceStatsInfo_Stats_ResourceStats; add GetResourceStats(ctx) helper building hostname/IP and optional cgroup metrics; update tests to new types.
E2E tests & runner
tests/v2/e2e/config/enums.go, tests/v2/e2e/config/config.go, tests/v2/e2e/crud/strategy_test.go, tests/v2/e2e/crud/stats_test.go
Add OpResourceStatsDetail operation, wire string aliases, noop bind handling, add stats_test.go with resourceStatsDetail helper and processStats dispatcher, integrate operation into strategy execution flow.
Rust protobuf codegen
rust/libs/proto/src/payload/v1/payload.v1.rs
Regenerate Rust bindings to reflect nested Info::Stats types (ResourceStats, ResourceStatsDetail, CgroupStats) and update full_name/type_url entries.
Rust deps
rust/libs/kvs/Cargo.toml, rust/libs/observability/Cargo.toml, rust/libs/vqueue/Cargo.toml
Minor dependency bumps (tokio 1.49→1.50, wincode 0.4.4→0.4.5, anyhow 1.0.101→1.0.102).
Gateway index handler
pkg/gateway/lb/handler/grpc/index.go
Fix: separate RPC error variable (callErr) from context/local err, ensuring correct error classification, status parsing, and span recording.
Misc
dockers/agent/core/agent/Dockerfile
Run cargo clean before building Rust release binary.

Sequence Diagram(s)

sequenceDiagram
    actor Client
    participant GatewayLB as "Gateway LB"
    participant Agent1 as "Agent 1"
    participant Agent2 as "Agent 2"
    participant AgentN as "Agent N"
    participant StatsUtil as "Stats Util"

    Client->>GatewayLB: ResourceStatsDetail()
    GatewayLB->>GatewayLB: start trace span\ninit Details map

    par broadcast
        GatewayLB->>Agent1: ResourceStats() [sub-span]
        Agent1->>StatsUtil: GetResourceStats(ctx)
        StatsUtil-->>Agent1: stats (hostname, ip, cgroup)
        Agent1-->>GatewayLB: ResourceStats response -> Details[agent1]

        GatewayLB->>Agent2: ResourceStats() [sub-span]
        Agent2->>StatsUtil: GetResourceStats(ctx)
        StatsUtil-->>Agent2: stats
        Agent2-->>GatewayLB: Details[agent2]

        GatewayLB->>AgentN: ResourceStats() [sub-span]
        AgentN->>StatsUtil: GetResourceStats(ctx)
        StatsUtil-->>AgentN: stats
        AgentN-->>GatewayLB: Details[agentN]
    end

    GatewayLB->>GatewayLB: aggregate Details\nrecord per-call statuses
    GatewayLB-->>Client: ResourceStatsDetail{ details: map[...] }
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~28 minutes

Possibly related PRs

Suggested reviewers

  • Matts966
  • datelier
  • kpango

Poem

📊 Agents whisper metrics near and far,
Gateways gather echoes like a chart;
Details map each host, each little star,
Traces tie the pieces — whole, not part. ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 5.88% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Add ResourceStatsDetail gRPC API' accurately reflects the primary objective of this changeset, which introduces a new gRPC service and proto definitions for ResourceStatsDetail across agents and gateway.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/gateway/add-resource-stats-detail
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

CodeRabbit can scan for known vulnerabilities in your dependencies using OSV Scanner.

OSV Scanner will automatically detect and report security vulnerabilities in your project's dependencies. No additional configuration is required.

@vdaas-ci
Copy link
Copy Markdown
Collaborator

[CHATOPS:HELP] ChatOps commands.

  • 🙆‍♀️ /approve - approve
  • 🍱 /format - format codes and add licenses
  • /gen-test - generate test codes
  • 🏷️ /label - add labels
  • 🔚 2️⃣ 🔚 /label actions/e2e-deploy - run E2E deploy & integration test

@codecov
Copy link
Copy Markdown

codecov bot commented Dec 23, 2025

Codecov Report

❌ Patch coverage is 2.80899% with 173 lines in your changes missing coverage. Please review.
✅ Project coverage is 25.97%. Comparing base (70de684) to head (63e5986).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
pkg/gateway/lb/handler/grpc/stats.go 0.00% 100 Missing ⚠️
pkg/gateway/lb/handler/grpc/index.go 0.00% 22 Missing ⚠️
internal/client/v1/client/vald/vald.go 0.00% 20 Missing ⚠️
pkg/agent/core/faiss/handler/grpc/stats.go 0.00% 13 Missing ⚠️
pkg/agent/core/ngt/handler/grpc/stats.go 0.00% 13 Missing ⚠️
apis/grpc/v1/vald/vald.go 0.00% 1 Missing ⚠️
pkg/agent/core/faiss/usecase/agentd.go 0.00% 1 Missing ⚠️
pkg/agent/core/ngt/usecase/agentd.go 0.00% 1 Missing ⚠️
pkg/gateway/lb/handler/grpc/handler.go 0.00% 1 Missing ⚠️
pkg/gateway/lb/usecase/vald.go 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3435      +/-   ##
==========================================
+ Coverage   25.02%   25.97%   +0.94%     
==========================================
  Files         544      572      +28     
  Lines       43800    50475    +6675     
==========================================
+ Hits        10963    13109    +2146     
- Misses      32108    36395    +4287     
- Partials      729      971     +242     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jumpei527 jumpei527 force-pushed the feature/gateway/add-resource-stats-detail branch from c3df358 to 2df183b Compare January 16, 2026 14:19
@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages bot commented Jan 16, 2026

Deploying vald with  Cloudflare Pages  Cloudflare Pages

Latest commit: 63e5986
Status: ✅  Deploy successful!
Preview URL: https://1d6d77f5.vald.pages.dev
Branch Preview URL: https://feature-gateway-add-resource.vald.pages.dev

View logs

@github-actions github-actions bot added size/XL and removed size/XL labels Jan 16, 2026
@jumpei527 jumpei527 added the actions/build/arm64 Build ARM64 Docker Images label Jan 22, 2026
@kpango
Copy link
Copy Markdown
Collaborator

kpango commented Mar 9, 2026

/format

@vdaas-ci
Copy link
Copy Markdown
Collaborator

vdaas-ci commented Mar 9, 2026

[FORMAT] Updating license headers and formatting go codes triggered by kpango.

Signed-off-by: Vdaas CI <vald@vdaas.org>
@github-actions github-actions bot added size/XXL and removed size/XXL labels Mar 9, 2026
kmrmt
kmrmt previously approved these changes Mar 9, 2026
Signed-off-by: Jumpei Tanikawa <jumpei5278@gmail.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
apis/proto/v1/payload/payload.proto (1)

654-678: ⚠️ Potential issue | 🟠 Major

Still unresolved: preserve Info.ResourceStats on the v1 surface.

Rehoming this message under Info.Stats renames the generated SDK type for the existing stats RPC (Info_ResourceStatsInfo_Stats_ResourceStats in Go), which is a source-breaking change for current v1 clients. Keep the legacy top-level message and hang the new detail shape off it instead.

You can verify the generated surface change with:

#!/bin/bash
set -euo pipefail

rg -n -C2 'type Info_ResourceStats|type Info_Stats_ResourceStats' apis/grpc/v1/payload/payload.pb.go
rg -n -C2 '\bResourceStats\b' apis/proto/v1/rpc/stats/stats.proto
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apis/proto/v1/payload/payload.proto` around lines 654 - 678, The change moved
the ResourceStats protobuf message under Info.Stats which renames the generated
Go type (Info_ResourceStats → Info_Stats_ResourceStats) and breaks v1 clients;
restore the original top-level message name and keep the new detailed shape as a
nested type. Specifically, reintroduce a top-level message ResourceStats with
the same fields (cpu_limit_cores, cpu_usage_cores, memory_limit_bytes,
memory_usage_bytes and the name/ip fields) so the generated type
Info_ResourceStats remains, and modify Stats (or Info.Stats) to reference that
top-level ResourceStats for existing usages while adding a new
ResourceStatsDetail (or similar nested message) that can contain the map<string,
ResourceStats> for the detailed shape.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@apis/proto/v1/payload/payload.proto`:
- Around line 654-678: The change moved the ResourceStats protobuf message under
Info.Stats which renames the generated Go type (Info_ResourceStats →
Info_Stats_ResourceStats) and breaks v1 clients; restore the original top-level
message name and keep the new detailed shape as a nested type. Specifically,
reintroduce a top-level message ResourceStats with the same fields
(cpu_limit_cores, cpu_usage_cores, memory_limit_bytes, memory_usage_bytes and
the name/ip fields) so the generated type Info_ResourceStats remains, and modify
Stats (or Info.Stats) to reference that top-level ResourceStats for existing
usages while adding a new ResourceStatsDetail (or similar nested message) that
can contain the map<string, ResourceStats> for the detailed shape.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 61a63792-cabc-46bf-a663-a296fdcccb75

📥 Commits

Reviewing files that changed from the base of the PR and between 5f93be8 and b8e4ac2.

⛔ Files ignored due to path filters (3)
  • apis/grpc/v1/payload/payload.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/payload/payload_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go, !**/*_vtproto.pb.go
  • rust/libs/proto/src/payload/v1/payload.v1.serde.rs is excluded by !**/*.serde.rs
📒 Files selected for processing (5)
  • apis/docs/v1/docs.md
  • apis/docs/v1/payload.md.tmpl
  • apis/proto/v1/payload/payload.proto
  • pkg/gateway/lb/handler/grpc/index.go
  • rust/libs/proto/src/payload/v1/payload.v1.rs

@jumpei527 jumpei527 requested a review from Matts966 March 17, 2026 13:12
This reverts commit 628704b.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IndexStatistics can panic when all agent responses have Valid=false. mergeInfoIndexStatistics skips invalid entries but still indexes into empty slices.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Matts966
I'll fix the empty slice issue.
As a related question, I noticed we are currently dividing by len(stats) to calculate the averages at the end. Since we are skipping the invalid entries, shouldn't we use the actual number of valid entries as the denominator instead?
Please let me know if I should include this fix as well.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jumpei527
I think you are correct. Please use the actual number of valid entries.

@vdaas-ci
Copy link
Copy Markdown
Collaborator

Profile Report

typevald-agent-ngtvald-lb-gatewayvald-discoverervald-manager-index
cpu
heap
other images

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants