Skip to content

Fix gRPC broadcast latency and tune connection pool#3478

Closed
kpango wants to merge 1 commit intomainfrom
fix-grpc-broadcast-latency-17105619334351837761
Closed

Fix gRPC broadcast latency and tune connection pool#3478
kpango wants to merge 1 commit intomainfrom
fix-grpc-broadcast-latency-17105619334351837761

Conversation

@kpango
Copy link
Copy Markdown
Collaborator

@kpango kpango commented Feb 18, 2026

This pull request primarily updates dependencies and environment versions used in CI/CD workflows, issue templates, and adds new benchmark and configuration files to the codebase. The changes are focused on keeping the project up-to-date with the latest tools, improving test coverage, and ensuring accurate environment information.

Dependency and workflow updates:

  • Upgraded several GitHub Actions to their latest versions, including actions/upload-artifact, actions/download-artifact, shogo82148/actions-upload-release-asset, crazy-max/ghaction-import-gpg, and github/codeql-action across multiple workflow files for improved reliability and security. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11]

Environment version updates:

  • Updated Go, Rust, Docker, Kubernetes, Helm, NGT, and Faiss versions in issue templates and the pull request template to reflect the latest releases and ensure accurate environment reporting. [1] [2] [3]

Test and benchmark coverage improvements:

  • Added new benchmark test files: internal/net/grpc/benchmark_test.go and internal/net/grpc/pool/pool_bench_gethealthy_test.go to enhance performance testing of gRPC components. [1] [2]

File and configuration additions:

  • Introduced rust/bin/meta/src/config.rs to the Rust meta binary, likely for improved configuration management.

Build trigger enhancements:

@google-labs-jules
Copy link
Copy Markdown

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.


For security, I will only act on instructions from the user who triggered this task.

@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages bot commented Feb 18, 2026

Deploying vald with  Cloudflare Pages  Cloudflare Pages

Latest commit: 326a4a2
Status: ✅  Deploy successful!
Preview URL: https://8c7b037b.vald.pages.dev
Branch Preview URL: https://fix-grpc-broadcast-latency-1.vald.pages.dev

View logs

@vdaas-ci
Copy link
Copy Markdown
Collaborator

[CHATOPS:HELP] ChatOps commands.

  • 🙆‍♀️ /approve - approve
  • 🍱 /format - format codes and add licenses
  • /gen-test - generate test codes
  • 🏷️ /label - add labels
  • 🔚 2️⃣ 🔚 /label actions/e2e-deploy - run E2E deploy & integration test

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Feb 18, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds per-address singleflight synchronization to the gRPC client, refactors connection-pool health/refresh into async flows using gRPC connectivity states with optional metrics, increases default pool sizing and keepalive, updates gateway broadcast concurrency, adds benchmarks, expands CI triggers for singleflight, and bumps several Rust deps.

Changes

Cohort / File(s) Summary
gRPC client core
internal/net/grpc/client.go
Adds per-address singleflight.Group, enablePoolMetrics flag; serializes Connect/Disconnect via group.Do; refactors connect/reconnect, pool creation, and RangeConcurrent behavior.
Connection pool
internal/net/grpc/pool/pool.go, internal/net/grpc/pool/option.go
Adds enableMetrics option; changes health checks to use gRPC connectivity states; introduces async refreshConn via errgroup and altered dial/refresh semantics.
gRPC options & config
internal/net/grpc/option.go, internal/config/grpc.go, charts/vald/values.yaml
Increases default pool size and adds keepalive params; exposes WithEnableConnectionPoolMetrics and ConnectionPool.EnableMetrics config flag.
Benchmarks & tests
internal/net/grpc/benchmark_test.go, internal/net/grpc/pool/pool_bench_test.go
Adds bufconn-based broadcast benchmark and a high-contention pool benchmark.
Gateway behavior
pkg/gateway/lb/service/gateway.go
Broadcast now uses dynamic concurrency via GetAgentCount(fctx) instead of fixed -1.
CI workflows
.github/workflows/...
Adds internal/sync/singleflight/*.go to multiple workflow path filters so CI triggers on singleflight package changes.
Config & charts
charts/vald/values.yaml
Adds defaults.grpc.client.connection_pool.enable_metrics boolean defaulting to false.
Rust deps
rust/libs/*/Cargo.toml
Bumps several Rust dependency versions (e.g., tonic, futures-core, wincode).
New tests file
internal/net/grpc/benchmark_test.go
Introduces a bufconn-based mock broadcast server and BenchmarkBroadcast.

Sequence Diagram(s)

sequenceDiagram
    participant Client as gRPC Client
    participant SF as singleflight (per-addr)
    participant Pool as Connection Pool
    participant Server as Remote gRPC Server

    Client->>SF: Connect(addr) [group.Do]
    SF->>Pool: ensure/create/upgrade pool
    Pool->>Pool: inspect connectivity state (Ready/Idle/Connecting/TransientFailure/Shutdown)
    alt Idle
        Pool->>Pool: trigger async refresh/dial (errgroup)
        Pool-->>SF: supply connection when ready
    else Ready or Connecting
        Pool-->>SF: return connection
    else TransientFailure
        Pool-->>SF: return last-resort conn and schedule refresh
    else Shutdown
        Pool->>Pool: async dial new conn
    end
    SF-->>Client: connection returned
    Client->>Server: RPCs via chosen connection
    Client->>SF: Disconnect(addr) [group.Do]
    SF->>Pool: serialized teardown/replace
    Pool->>Pool: close/replace slots
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Suggested labels

type/feature, type/dependency

Suggested reviewers

  • kmrmt
  • Matts966
  • datelier

Poem

Singleflight queues one call in line,
Pools wake and sleep by connectivity sign.
Bufconn hums benchmarks in-memory bright,
CI learns to watch the singleflight.
Metrics sleep off — now ready to shine.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title directly captures the two main objectives: fixing gRPC broadcast latency and tuning the connection pool configuration.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix-grpc-broadcast-latency-17105619334351837761

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link
Copy Markdown

codecov bot commented Feb 18, 2026

Codecov Report

❌ Patch coverage is 15.29052% with 277 lines in your changes missing coverage. Please review.
✅ Project coverage is 26.31%. Comparing base (70de684) to head (1fade9f).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
internal/net/grpc/pool/pool.go 0.00% 137 Missing ⚠️
internal/net/grpc/client.go 0.00% 122 Missing ⚠️
hack/helm/schema/crd/main.go 0.00% 6 Missing ⚠️
internal/servers/server/server.go 50.00% 4 Missing and 1 partial ⚠️
internal/net/grpc/option.go 0.00% 3 Missing ⚠️
internal/net/grpc/pool/option.go 33.33% 1 Missing and 1 partial ⚠️
hack/benchmark/internal/core/algorithm/ngt/ngt.go 0.00% 1 Missing ⚠️
hack/license/gen/main.go 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3478      +/-   ##
==========================================
+ Coverage   25.02%   26.31%   +1.28%     
==========================================
  Files         544      571      +27     
  Lines       43800    50472    +6672     
==========================================
+ Hits        10963    13283    +2320     
- Misses      32108    36209    +4101     
- Partials      729      980     +251     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/net/grpc/benchmark_test.go`:
- Around line 49-86: Move the setup of the gRPCClient, population of
client.conns (mockPool entries) and atomic.StoreUint64(&client.clientCount, ...)
so it runs before starting the benchmark timers; then call b.ReportAllocs() and
b.ResetTimer() once before the sub-benchmarks in BenchmarkBroadcast. Inside each
sub-benchmark replace the old for i := 0; i < b.N; i++ loops with the Go 1.24
style for b.Loop() loop, and keep the call to client.RangeConcurrent(...) and
its simulated latency inside that loop. Ensure you reference the
BenchmarkBroadcast function, the gRPCClient instance (client),
client.conns/mockPool population, and the RangeConcurrent call when making these
changes.
- Around line 36-47: The mockPool currently hands out zero-value grpc.ClientConn
structs (see mockPool.Do, mockPool.Get, mockPool.Reconnect) which violates gRPC
construction rules; replace those returns with real, properly-initialized
*grpc.ClientConn instances created by dialing a test server (use net.Listen +
grpc.NewServer or an in-process bufconn as done in pool_bench_test.go), update
the benchmark setup to start and tear down the test gRPC server and dial a
client connection once, and have mockPool methods return that *grpc.ClientConn
instead of &grpc.ClientConn{} so any later method calls on the connection are
safe.

In `@internal/net/grpc/option.go`:
- Line 43: The change increasing the connection pool to
WithConnectionPoolSize(10) can multiply persistent gRPC connections (e.g., 100
agents → 1,000 connections) and risk hitting OS file descriptor and memory
limits; update the configuration and docs to make the pool size safe and
configurable, validate/adjust WithIdleTimeout to reclaim idle connections, and
add a checklist/test to verify target nodes' ulimit -n and pod memory limits for
the expected scale. Specifically, ensure WithConnectionPoolSize is
documented/configurable (or reduced), tune WithIdleTimeout to a shorter reclaim
window, add runtime validation or warnings if computed total FDs may exceed
ulimit, and include guidance in deployment manifests about required pod memory
and ulimit values.
- Around line 43-44: The hardcoded client keepalive in WithKeepaliveParams
currently sets PermitWithoutStream=true which will cause servers (with default
permit_without_stream=false and min_time=10m) to GOAWAY; update the
WithKeepaliveParams usage in internal/net/grpc/option.go to use
PermitWithoutStream=false (or remove/disable the aggressive 30s keepalive) so
client pings won’t be sent without active streams, or alternatively coordinate
server-side config changes before landing; change the WithKeepaliveParams call
accordingly to prevent immediate GOAWAYs from Vald servers.

In `@internal/net/grpc/pool/pool_bench_test.go`:
- Around line 237-246: The benchmark currently runs measurement without
bracketing timers or reporting allocations; update the benchmark surrounding the
b.RunParallel(...) loop to match the other benchmarks by calling b.StopTimer(),
b.ReportAllocs(), b.ResetTimer(), and b.StartTimer() before entering the
b.RunParallel block (so pool setup and connection time are excluded) and call
b.StopTimer() immediately after the parallel loop; modify the same benchmark
that invokes p.Do(...) on *ClientConn to include these calls.
- Around line 212-216: Remove the stale development-commentary block that spans
the scratch notes referencing "pool_bench_test.go" and the
ListenAndServe/pool.New ideas (the comment lines between the existing test setup
and helper usage); simply delete those lines so the test file contains only
meaningful comments and code around the test setup (where ListenAndServe and
pool.New are used) to keep the test tidy.

In `@pkg/gateway/lb/service/gateway.go`:
- Line 91: The tracing span was started on fctx but GetAgentCount was called
with the stale ctx; change the call to use the traced context (call
g.GetAgentCount(fctx)) so the address lookup is attributed to the BroadCast
trace span, and ensure any other calls inside the same return expression use
fctx consistently (e.g., client.RangeConcurrent(fctx, g.GetAgentCount(fctx),
...)).

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
internal/net/grpc/pool/pool.go (2)

348-350: 🧹 Nitpick | 🔵 Trivial

Dead code: count == 0 is unreachable after the sz == 0 guard

The early-return at lines 328–330 guarantees sz > 0 before the loop. Since every iteration either increments count (in the default branch) or returns early via ctx.Done(), count is always ≥ 1 when execution reaches line 348. The block can never trigger.

🧹 Proposed cleanup
  }
- if count == 0 {
- 	return errors.ErrGRPCPoolConnectionNotFound
- }
  return nil
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/net/grpc/pool/pool.go` around lines 348 - 350, Remove the
unreachable dead branch that checks `if count == 0 { return
errors.ErrGRPCPoolConnectionNotFound }` in the function that iterates over the
pool (the loop guarded by `sz == 0` and using `count` and `ctx.Done()`). Because
`sz > 0` is ensured before the loop and every loop path either increments
`count` or returns on `ctx.Done()`, delete this `count == 0` check to clean up
dead code and leave the remaining return paths intact.

287-300: ⚠️ Potential issue | 🟡 Minor

Remove load() method — confirmed unused and contains bounds-checking bug

The method is dead code: no active calls exist in the codebase, and its test is commented out. More critically, the bounds check at line 294 is inverted: if sz < idx should be if idx >= sz. If that condition evaluates true, line 295 executes slots[sz].Load() where sz == len(slots), causing an out-of-bounds panic.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/net/grpc/pool/pool.go` around lines 287 - 300, Remove the dead and
unsafe method pool.load: it is unused and contains an inverted bounds check that
can index slots[len(slots)] causing an OOB panic; delete the load(idx uint64)
(ridx uint64, pc *poolConn) implementation from the pool type and any commented
tests referencing it, and run tests to confirm nothing else depends on the
function name load in type pool (search for method name "load" and type "pool"
to ensure no remaining callers).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/net/grpc/pool/pool_bench_test.go`:
- Around line 211-213: Replace the b.Fatal(err) call after p.Connect(ctx) with
b.Error(err) so the benchmark follows the file-wide convention used in
Benchmark_ConnPool, BenchmarkParallel_ConnPool and ListenAndServe; keep the
p.Connect(ctx) error path but use b.Error(err) (and ensure any subsequent code
remains safe if p is nil).

---

Outside diff comments:
In `@internal/net/grpc/pool/pool.go`:
- Around line 348-350: Remove the unreachable dead branch that checks `if count
== 0 { return errors.ErrGRPCPoolConnectionNotFound }` in the function that
iterates over the pool (the loop guarded by `sz == 0` and using `count` and
`ctx.Done()`). Because `sz > 0` is ensured before the loop and every loop path
either increments `count` or returns on `ctx.Done()`, delete this `count == 0`
check to clean up dead code and leave the remaining return paths intact.
- Around line 287-300: Remove the dead and unsafe method pool.load: it is unused
and contains an inverted bounds check that can index slots[len(slots)] causing
an OOB panic; delete the load(idx uint64) (ridx uint64, pc *poolConn)
implementation from the pool type and any commented tests referencing it, and
run tests to confirm nothing else depends on the function name load in type pool
(search for method name "load" and type "pool" to ensure no remaining callers).

---

Duplicate comments:
In `@internal/net/grpc/pool/pool_bench_test.go`:
- Around line 216-226: The benchmark BenchmarkPool_HighContention is missing the
standard measurement bracketing and is including setup latency and allocations;
before entering the parallel loop around p.Do(...) call stop the timer and call
b.ReportAllocs() then call b.StartTimer(), and after the parallel RunParallel
returns call b.StopTimer(); ensure you place these calls in the same pattern
used by the other benchmarks in this file so the allocation and setup overhead
are excluded from the measured loop.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/net/grpc/e2e_bench_test.go`:
- Around line 144-146: The code currently logs a non-fatal warning when
connectivity is partial (the time.Since(start) > 5*time.Second branch) which
allows the benchmark to continue with fewer than numAgents; change this to fail
the benchmark immediately by replacing the warning log with a fatal failure (use
b.Fatalf or equivalent) so the benchmark halts when connected < numAgents after
the timeout; update the branch that references start, connected, and numAgents
in e2e_bench_test.go to call b.Fatalf with a clear message including connected
and numAgents.
- Around line 88-101: startBenchServer currently calls b.Fatalf on error which
panics when invoked from goroutines; change startBenchServer to return (addr
string, td func(), err error) instead of calling b.Fatalf, update the goroutine
to capture the returned err into a shared errors slice (protected by setupMu)
along with addrs and teardowns, and after setupWg.Wait() check the collected
errors and call b.Fatalf (or b.Fatalf with combined error info) from the
benchmark goroutine if any error occurred; update any callers of
startBenchServer to handle the new error return.
- Line 178: The test currently calls client.Close(ctx) at the end (client.Close)
which leaks connections if a sub-benchmark panics; after creating the client
(the variable named client in e2e_bench_test.go) immediately call defer
client.Close(ctx) so the connection is always cleaned up on panic or early
return, and remove the standalone client.Close(ctx) call at line 178.
- Around line 59-72: The error channel ech created for srv.ListenAndServe is
never read so startup failures are ignored; change the logic to drain/check ech
to detect bind/startup errors by launching a background goroutine that reads
from ech and records any error (or use a non-blocking select/read after the
sleep) and fail the test/benchmark if an error is received; also replace the
fragile time.Sleep with a readiness check or by waiting for either a successful
"started" signal or an error from ech (use the existing ctx/cancel and wg to
coordinate shutdown).
- Around line 83-84: The benchmark silently swallows port conflicts because it
uses hardcoded ports and discards srv.ListenAndServe() errors; fix by adding a
WithListener(server.New option) to accept a pre-bound net.Listener and a method
like BoundAddr()/Addr() on the server to retrieve the actual bound address, then
in the test do l, _ := net.Listen("tcp", ":0"), pass l via WithListener to
server.New, obtain the bound address via BoundAddr()/Addr() and use that for
agents instead of fixed ports; additionally capture srv.ListenAndServe() errors
by sending them on a startup error channel and fail fast (or log and abort) if
no successful start within the poll timeout instead of ignoring the error, or
alternatively use IsConnected() polling against the server address to ensure all
agents are reachable before proceeding.

@kpango kpango force-pushed the fix-grpc-broadcast-latency-17105619334351837761 branch 2 times, most recently from 7e7dd4a to 6fd320d Compare February 19, 2026 10:02
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/net/grpc/benchmark_test.go`:
- Around line 66-77: startBroadcastServer currently launches srv.ListenAndServe
in a goroutine and returns immediately, causing the benchmark to busy-spin
retrying connections; change startBroadcastServer to provide a readiness signal
and have the caller wait on it (e.g., reuse the existing ech channel: have
ListenAndServe send a ready/started value or close a dedicated ready channel
once listening) or use a sync.WaitGroup to block until the server is accepting
connections before returning; alternatively, add a small sleep/backoff in the
retry loop that uses addr to avoid tight CPU spinning—update references to
ListenAndServe, ech, startBroadcastServer, and the caller's retry loop to use
the chosen readiness mechanism.
- Around line 44-77: The code has a TOCTOU race: you call net.Listen to get a
free port (l), close it, then create a server that re-binds later (risking
collisions). Modify startBroadcastServer to avoid closing the pre-bound
listener: either (1) change server.New / server.ListenAndServe to accept a
net.Listener (pass l through a new option like server.WithListener or a
constructor overload) and register gRPC on that listener inside
WithGRPCRegisterar, or (2) use an in-memory bufconn listener for the test
(bufconn.Listen) and pass that into the server, or if server supports it use
server.WithHost("localhost") + server.WithPort(0) so the server itself binds an
ephemeral port; if none are possible add a short retry loop when binding. Update
code references: net.Listen (l), server.New, server.WithGRPCRegisterar,
srv.ListenAndServe, and srv.Shutdown to operate on the pre-bound listener
instead of closing l then rebinding.
- Around line 114-127: The retry loop that calls client.Connect in the goroutine
spins without any backoff and uses the standard sync.WaitGroup; change the
goroutine in the benchmark to (1) replace the stdlib WaitGroup with Vald's
internal/sync WaitGroup type (from package "internal/sync") used throughout the
repo, and (2) add a small backoff/sleep between Connect attempts inside the for
i := 0; i < 10; i++ loop (e.g., time.Sleep with a short duration or an
exponential/backoff strategy) so Connect is not hammered; keep the
atomic.AddInt64(&connectedCount, 1) and return behavior unchanged and ensure the
new WaitGroup variable (wg) is the internal/sync version so wg.Done() and
wg.Wait() continue to work.

In `@internal/net/grpc/pool/pool.go`:
- Around line 587-599: Remove the development scratch comments inside the dial
logic: delete the stray internal reasoning lines between the healthy check and
the state handling (the comments mentioning "But NewClient returns Idle
usually..." and "isHealthy returns true for Idle..."). Keep the conditional
behavior intact (use p.isHealthy(idx, conn), inspect conn.GetState() and the
connectivity.TransientFailure branch) but tidy the block so only concise,
production-ready comments remain or no comment at all.
- Around line 655-658: The Debugf call in the connectivity.TransientFailure case
of pool.go has four format verbs but only three arguments (pc.addr, idx+1,
state.String()); add the missing pool-size argument to match the third %d (e.g.,
use the pool size field or computed length such as pc.poolSize or len(pc.conns))
and ensure arguments are ordered as (pc.addr, idx+1, <pool-size>,
state.String()) so the format string and arguments align.
- Around line 665-681: The second-pass loop currently calls p.refreshConn
synchronously (which calls p.dial / grpc.NewClient) and can block request
latency; change the logic in the loop that iterates over slots (where load, pc,
pc.conn and connectivity states are inspected) to not synchronously call
refreshConn but instead schedule the refresh asynchronously via
p.errGroup.Go(func(){ _ = p.refreshConn(ctx, idx, pc, p.addr) }) and immediately
continue/return an error to the caller when no usable connection was found on
the first pass; keep the existing behavior of returning a TransientFailure pc as
a last resort, but ensure refresh attempts are backgrounded on p.errGroup and
the caller is not blocked on network dial timeouts.
- Around line 897-901: The review notes that treating
connectivity.TransientFailure as "healthy" is intentional but non-obvious;
update the code to document this design decision: add a clear comment near the
IsHealthy implementation (and the getHealthyConn function) explaining that
TransientFailure is considered healthy to allow gRPC's exponential backoff and
internal retries to recover, that getHealthyConn performs a two-pass selection
(preferring Ready/Idle/Connecting and only returning TransientFailure as a last
resort), and that Reconnect/Connect logic intentionally skips hard reconnects
for TransientFailure—reference the IsHealthy, getHealthyConn, Reconnect, and
Connect symbols so future readers understand the semantic shift from
"immediately usable" to "not permanently dead."

---

Duplicate comments:
In `@internal/net/grpc/benchmark_test.go`:
- Around line 132-141: The benchmark currently swallows errors and uses a manual
for-loop; update the loop to use b.Loop() (as in pool_bench_test.go) and surface
failures instead of ignoring them: inside the RangeConcurrent call handler (and
after it) replace the commented-out b.Fatalf with either b.Fatalf to fail loudly
on error or b.Error to record the failure while continuing, and ensure the outer
loop uses for b.Loop() rather than for i := 0; i < b.N; i++ so the benchmark
reports meaningful results; reference RangeConcurrent, b.Fatalf/b.Error, and
b.Loop when making the changes.

In `@internal/net/grpc/option.go`:
- Around line 43-44: The client keepalive settings used with WithKeepaliveParams
(ping interval "30s" and PermitWithoutStream true) violate the Vald server
enforcement (min_time: 10m, permit_without_stream: false) causing GOAWAYs;
update the grpc client options used in option.go (e.g., the call site
constructing WithKeepaliveParams and WithConnectionPoolSize) to set
PermitWithoutStream to false and increase the ping interval to at least the
server min_time (e.g., >=10m), or alternatively coordinate and document that
server enforcement vars (min_time, permit_without_stream) will be changed across
Vald before landing this PR.

In `@internal/net/grpc/pool/pool_bench_test.go`:
- Around line 211-246: Remove the stale scratch comments at the start of
BenchmarkPool_HighContention (they add no value), and add benchmark hygiene:
call b.ReportAllocs() before starting measurements, wrap setup in b.StopTimer()
and b.StartTimer() so pool creation/Connect (New, Connect, ListenAndServe) are
excluded from timing, then ensure you call b.StopTimer() after the b.RunParallel
measurement completes; keep the test logic the same.

In `@pkg/gateway/lb/service/gateway.go`:
- Line 91: Call GetAgentCount using the tracing context fctx (not the original
ctx) so the span started on fctx is preserved; change the invocation in
gateway.go from g.GetAgentCount(ctx) to g.GetAgentCount(fctx) so RangeConcurrent
and GetAgentCount share the same trace context and maintain consistent trace
attribution.

@kpango kpango force-pushed the fix-grpc-broadcast-latency-17105619334351837761 branch 3 times, most recently from 4a0c245 to 9a7f31a Compare February 19, 2026 10:17
@kpango
Copy link
Copy Markdown
Collaborator Author

kpango commented Feb 19, 2026

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Feb 19, 2026

✅ Actions performed

Full review triggered.

@kpango kpango force-pushed the fix-grpc-broadcast-latency-17105619334351837761 branch from 9a7f31a to c1f0a9f Compare February 19, 2026 10:22
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/net/grpc/benchmark_test.go`:
- Around line 132-140: The benchmark currently ignores errors returned by
client.RangeConcurrent which can hide RPC failures; update the benchmark loop in
benchmark_test.go to check the err from client.RangeConcurrent and call b.Fatalf
(or b.Error/b.FailNow) with a descriptive message including err to fail the
benchmark on RPC errors; also ensure the anonymous function passed to
RangeConcurrent returns any inner RPC error (from c.Nodes) so the outer err
reflects failures from the inner call.
- Line 142: Add the required unit-test delimiter by inserting the comment line
"// NOT IMPLEMENTED BELOW" immediately after the b.StopTimer() call in
benchmark_test.go (the block containing the b.StopTimer() statement); this will
separate implemented tests from unimplemented ones per repo guidelines.

In `@internal/net/grpc/client.go`:
- Around line 1047-1098: When swapping in a newly created pool in the code that
calls pool.New/conn.Connect, capture any existing connection from
g.conns.Load(addr) before replacing it; after the new conn is connected and
health-checked, if an oldConn existed call g.Disconnect(ctx, addr) (join any
disconnect error into the returned error or log it) and do not increment
g.clientCount; only atomic.AddUint64(&g.clientCount, 1) when there was no
previous connection for that addr. Ensure g.conns.Store(addr, conn) happens
after the successful swap so the new conn replaces the old one and avoid leaking
the old pool.

In `@internal/net/grpc/pool/pool_bench_test.go`:
- Line 247: Add the required placeholder by inserting a single line containing
"// NOT IMPLEMENTED BELOW" on its own line in pool_bench_test.go immediately
before the file end (i.e., after the last implemented test / before the final
closing brace or trailing newline) so the test file includes the delimiter
separating implemented tests from unimplemented ones.

---

Duplicate comments:
In `@internal/net/grpc/benchmark_test.go`:
- Around line 43-50: The test suffers a TOCTOU race by listening on
"localhost:0", closing the listener, then re-binding later in
startBroadcastServer; change startBroadcastServer to accept a pre-bound
net.Listener (or create and use a bufconn listener) so the server uses the same
listener instead of re-binding, or modify startBroadcastServer to let the server
bind to port 0 itself and retrieve the actual port from the server's listener;
update call sites that currently expect a string port to instead provide the
listener (or adapt to the server-returned address) and keep references to
startBroadcastServer and any errgroup usage consistent.
- Around line 110-126: The retry loop for connecting to agents currently spins
tightly and uses the stdlib sync.WaitGroup; replace this with Vald's internal
sync utilities (e.g., internal/sync or internal/sync/errgroup) and add a short
backoff between retries to avoid busy looping: in the goroutine that calls
client.Connect(ctx, a) for each addr in addrs, use the internal errgroup or
waitgroup equivalent instead of wg, and on each failed attempt sleep for a small
duration (e.g., a few milliseconds) before retrying; ensure connectedCount
(atomic.AddInt64) is still incremented on success and that the new sync helper
is used to wait for all goroutines to finish.

In `@internal/net/grpc/option.go`:
- Around line 42-44: The default keepalive options currently set in
defaultOptions use WithKeepaliveParams("30s", "10s", true) which enables
PermitWithoutStream and can trigger server GOAWAY when servers enforce
min_time/permit_without_stream; change the default to disable
PermitWithoutStream (set the third argument to false) or replace the default
WithKeepaliveParams call with a safer value (e.g.,
WithKeepaliveParams("30s","10s", false)) so PermitWithoutStream is not enabled
by default; update any related comments and ensure WithConnectionPoolSize(10)
remains unchanged.

In `@internal/net/grpc/pool/pool_bench_test.go`:
- Around line 211-216: Remove the development scratch comments inside the
BenchmarkPool_HighContention function: delete the multi-line notes about mock
server, dialer, and ListenAndServe so the benchmark contains only the actual
test setup and benchmark code; keep the function signature
BenchmarkPool_HighContention and any real setup/helper calls, but remove all
commented-out planning text to clean up the test.
- Around line 237-246: The benchmark in pool_bench_test.go is measuring setup
time/allocs because timers aren't bracketed: wrap the setup work with
b.StopTimer()/b.StartTimer() like the other benchmarks so only the p.Do loop is
measured; specifically call b.StopTimer() before any setup done prior to
b.RunParallel (if any), perform setup, then call b.StartTimer() immediately
before b.RunParallel that invokes p.Do and ClientConn, ensuring the
b.ResetTimer() remains where appropriate to clear prior measurements.

In `@internal/net/grpc/pool/pool.go`:
- Around line 587-599: Remove the informal "development scratch" comments inside
the dial logic around p.isHealthy, NewClient and the state handling (the
multi-line notes about Idle, TransientFailure, Shutdown, and "we might want to
keep it"); replace them with a short, descriptive comment (or none) that clearly
states intent—for example: "If connection is in TransientFailure, retain and let
it retry; otherwise close." Ensure references to conn.GetState(),
connectivity.TransientFailure and p.isHealthy(idx, conn) remain untouched
functionally, only the explanatory comments are cleaned up.

@kmrmt
Copy link
Copy Markdown
Contributor

kmrmt commented Feb 19, 2026

/format

@vdaas-ci
Copy link
Copy Markdown
Collaborator

[FORMAT] Updating license headers and formatting go codes triggered by kmrmt.

@kpango kpango force-pushed the fix-grpc-broadcast-latency-17105619334351837761 branch from b1a485a to 59b91f6 Compare February 19, 2026 21:56
@kpango
Copy link
Copy Markdown
Collaborator Author

kpango commented Mar 16, 2026

@copilot Can you fix like below?

Great suggestion! Adding a reproducible CI container (custom runner image) ensures your build environment is consistent and includes all required tools (like gzip/pigz), reducing platform-related errors. Here’s how you can do it for your workflow in vdaas/vald:


1. Create a Dockerfile for your CI container

Place this file under .github/ci/Dockerfile (or wherever suits your repo):

FROM ubuntu:22.04

# Install required build tools
RUN apt-get update && \
    apt-get install -y \
        build-essential \
        git \
        pigz \
        curl \
        ca-certificates \
        docker.io \
        # add any additional tools you need

    # Clean up
    && apt-get clean && rm -rf /var/lib/apt/lists/*

Adapt the list based on what your build needs (e.g., other language runtimes/tools).


2. Update your GitHub Actions workflow

In .github/workflows/dockers-agent-ngt-image.yaml, use the custom container for your CI job:

jobs:
  build-agent-ngt:
    runs-on: ubuntu-latest
    container:
      image: ghcr.io/vdaas/vald/ci-agent:latest
      options: --privileged # If you need nested Docker
    steps:
      - uses: actions/checkout@v4
      - name: Prune Docker Build Cache
        run: docker builder prune --force
      - name: Build Agent NGT Image
        run: docker build --no-cache -t agent-ngt .

3. Build and push your CI container to GitHub Container Registry

  1. Build the CI image locally:

    docker build -t ghcr.io/vdaas/vald/ci-agent:latest -f .github/ci/Dockerfile .
  2. Login and push:

    docker login ghcr.io
    docker push ghcr.io/vdaas/vald/ci-agent:latest

4. Fixing the build errors

With this dedicated CI container, you ensure tools like pigz, gzip, docker, etc. are always available and up-to-date. If you need specific versions or additional tools, just add them to your Dockerfile.


Summary:
Adding a custom CI container solves issues caused by missing system utilities and inconsistent environments. Update your workflow to reference this container, and your builds will be stable, reproducible, and tool-complete.

Let me know if you want a PR to add these files and workflow updates!

Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 16, 2026

@kpango I've opened a new pull request, #3503, to work on those changes. Once the pull request is ready, I'll request review from you.

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Mar 16, 2026

CLA assistant check
All committers have signed the CLA.

@kpango kpango force-pushed the fix-grpc-broadcast-latency-17105619334351837761 branch 5 times, most recently from 6582777 to a2a25af Compare March 23, 2026 18:53
@kpango
Copy link
Copy Markdown
Collaborator Author

kpango commented Mar 31, 2026

/format

@vdaas-ci
Copy link
Copy Markdown
Collaborator

[FORMAT] Updating license headers and formatting go codes triggered by kpango.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates CI/dependency/tooling versions and introduces several gRPC connection-pool and server shutdown changes aimed at reducing broadcast latency, improving pool behavior, and adding optional pool metrics/config knobs.

Changes:

  • Bump pinned GitHub Actions/tool versions and environment version references (Go/Rust/Docker/K8s/Helm/etc.).
  • Update gRPC client/pool behavior (pool sizing defaults, keepalive defaults, async refresh, metrics toggle) and improve server shutdown to close listeners.
  • Add/extend tests and benchmarks (gRPC pool/getHealthyConn, broadcast benchmarks, cgroup stats determinism) and introduce Rust meta config loading/CLI args.

Reviewed changes

Copilot reviewed 140 out of 148 changed files in this pull request and generated no comments.

Show a summary per file
File Description
versions/actions/SOFTPROPS_ACTION_GH_RELEASE Bump pinned GitHub Action version.
versions/actions/SHOGO82148_ACTIONS_UPLOAD_RELEASE_ASSET Bump pinned GitHub Action version.
versions/actions/GITHUB_ISSUE_METRICS Bump pinned GitHub Action version.
versions/actions/GITHUB_CODEQL_ACTION_UPLOAD_SARIF Bump pinned GitHub Action version.
versions/actions/GITHUB_CODEQL_ACTION_INIT Bump pinned GitHub Action version.
versions/actions/GITHUB_CODEQL_ACTION_AUTOBUILD Bump pinned GitHub Action version.
versions/actions/GITHUB_CODEQL_ACTION_ANALYZE Bump pinned GitHub Action version.
versions/actions/DOCKER_SETUP_BUILDX_ACTION Bump pinned GitHub Action version.
versions/actions/DOCKER_LOGIN_ACTION Bump pinned GitHub Action version.
versions/actions/CRAZY_MAX_GHACTION_IMPORT_GPG Bump pinned GitHub Action version.
versions/actions/CODECOV_CODECOV_ACTION Bump pinned GitHub Action version.
versions/actions/AZURE_SETUP_HELM Bump pinned GitHub Action version.
versions/actions/ACTIONS_UPLOAD_ARTIFACT Bump pinned GitHub Action version.
versions/actions/ACTIONS_SETUP_GO Bump pinned GitHub Action version.
versions/actions/ACTIONS_DOWNLOAD_ARTIFACT Bump pinned GitHub Action version.
versions/actions/ACTIONS_CACHE Bump pinned GitHub Action version.
versions/ZLIB_VERSION Update bundled tool/library version pin.
versions/YQ_VERSION Update tool version pin.
versions/TELEPRESENCE_VERSION Update tool version pin.
versions/RUST_VERSION Update Rust version pin.
versions/PROTOBUF_VERSION Update tool version pin.
versions/PROMETHEUS_STACK_VERSION Update Helm chart version pin.
versions/NGT_VERSION Update NGT version pin.
versions/KUBELINTER_VERSION Update tool version pin.
versions/KUBECTL_VERSION Update tool version pin.
versions/K3S_VERSION Update tool version pin.
versions/HELM_VERSION Update tool version pin.
versions/HDF5_VERSION Update library version pin.
versions/GO_VERSION Update Go version pin.
versions/GOLANGCILINT_VERSION Update linter version pin.
versions/FAISS_VERSION Update Faiss version pin.
versions/DOCKER_VERSION Update Docker version pin.
versions/CMAKE_VERSION Update CMake version pin.
versions/CHAOS_MESH_VERSION Update Chaos Mesh version pin.
versions/BUF_VERSION Update Buf version pin.
rust/rust-toolchain.toml Bump Rust toolchain channel.
rust/libs/vqueue/Cargo.toml Bump Rust dependency versions.
rust/libs/observability/Cargo.toml Bump Rust dependency versions.
rust/libs/kvs/Cargo.toml Bump Rust dependency versions.
rust/libs/algorithms/qbg/build.rs Add ThinLTO flag to C++ build step.
rust/libs/algorithms/ngt/build.rs Add ThinLTO flag to C++ build step.
rust/bin/meta/src/main.rs Add CLI args + config loading/validation for meta binary.
rust/bin/meta/src/config.rs New YAML config + validation + observability conversion.
rust/bin/meta/Cargo.toml Add config/CLI deps and dev-deps for tests.
rust/bin/agent/Cargo.toml Bump Rust dependency versions.
rust/Cargo.toml Enable ThinLTO + codegen-units tuning for release builds.
pkg/agent/core/ngt/handler/grpc/insert.go Prefer Vector.Timestamp over Config.Timestamp for inserts.
pkg/agent/core/ngt/handler/grpc/index_test.go Update tests to account for timestamp migration.
k8s/operator/helm/crds/valdrelease.yaml Add enable_metrics schema field for connection pool.
k8s/manager/index/deployment.yaml Update config checksum annotation.
k8s/manager/index/configmap.yaml Add enable_metrics to connection_pool config.
k8s/index/job/save/configmap.yaml Add enable_metrics to connection_pool config.
k8s/index/job/deletion/configmap.yaml Add enable_metrics to connection_pool config.
k8s/index/job/creation/configmap.yaml Add enable_metrics to connection_pool config.
k8s/index/job/correction/configmap.yaml Add enable_metrics to connection_pool config.
k8s/gateway/gateway/mirror/deployment.yaml Update config checksum annotation.
k8s/gateway/gateway/mirror/configmap.yaml Add enable_metrics to connection_pool config.
k8s/gateway/gateway/lb/deployment.yaml Update config checksum annotation.
k8s/gateway/gateway/lb/configmap.yaml Add enable_metrics to connection_pool config.
internal/servers/server/server.go Track listener and close it during Shutdown.
internal/net/grpc/stats/stats_test.go Make cgroup tests deterministic via temp dirs; avoid parallel.
internal/net/grpc/stats/stats.go Add proc path override + better cgroup v2 path detection.
internal/net/grpc/server_test.go Adjust comparison options for unexported fields.
internal/net/grpc/pool/pool_bench_test.go Add high-contention pool benchmark.
internal/net/grpc/pool/pool_bench_gethealthy_test.go Add getHealthyConn benchmarks (serial + parallel).
internal/net/grpc/pool/pool.go Rework pool selection/refresh/health + add metrics toggle.
internal/net/grpc/pool/option.go Add pool option to enable/disable metrics.
internal/net/grpc/option.go Change default pool size and add default keepalive params + metrics option.
internal/net/grpc/errdetails/errdetails_test.go Add comprehensive tests for detail decoding/serialization.
internal/net/grpc/client.go Add singleflight around connect/disconnect + pool metrics wiring.
internal/net/grpc/benchmark_test.go New benchmarks for broadcast and RPC execution paths.
internal/core/algorithm/ngt/doc.go Update NGT upstream link.
internal/core/algorithm/faiss/Capi.cpp Add missing include (sstream).
internal/config/grpc_test.go Update expected options due to new connection-pool metrics option.
internal/config/grpc.go Add connection_pool.enable_metrics config and wire to grpc options.
hack/helm/schema/crd/main.go Optimize indent function string building.
hack/go.mod.default Update Go directive version.
hack/docker/gen/main.go Add clang+lld LTO path and adjust deps/packages.
hack/benchmark/internal/core/algorithm/ngt/ngt.go Update NGT search call signature.
example/client/go.sum Update dependency checksums.
example/client/go.mod.default Update Go directive version.
example/client/go.mod Update Go directive + dependency versions.
docs/user-guides/configuration.md Update NGT upstream links.
docs/usecase/usage-example.md Update NGT upstream links.
docs/tutorial/vald-agent-standalone-on-k8s.md Update NGT upstream links.
docs/performance/tuning-search-performance.md Update NGT upstream links in tables.
docs/overview/component/agent.md Update NGT upstream links.
docs/overview/component/README.md Update NGT upstream links.
docs/overview/architecture.md Update NGT upstream links.
docs/overview/about-vald.md Update NGT upstream links.
dockers/dev/Dockerfile Swap pkg-config->pkgconf; add pigz.
dockers/agent/core/agent/Dockerfile Add clang+lld, RUSTFLAGS, and ThinLTO build path.
cmd/tools/cli/benchmark/core/doc.go Update NGT upstream link.
charts/vald/values.yaml Add enable_metrics field to connection_pool defaults.
charts/vald/values.schema.json Add schema for enable_metrics and update NGT links.
charts/vald-with-tikv/values.yaml Update NGT links and adjust filename_suffix quoting.
charts/vald-helm-operator/crds/valdrelease.yaml Add enable_metrics schema field for connection pool.
README.md Update NGT upstream link.
Makefile.d/tools.mk Fix rustfmt install dependency ordering.
Makefile.d/docker.mk Build docker images with --network host.
Makefile.d/dependencies.mk Update NGT release lookup repo URL.
Makefile Update NGT repo and add NGT_EXTRA_CMAKE_FLAGS; run go fix in format target.
.gitignore Ignore generated /gen directory.
.github/workflows/update-pull-request-and-issue-template.yaml Bump action versions.
.github/workflows/update-deps.yaml Bump action versions.
.github/workflows/reviewdog-markdown.yaml Exclude an additional README from reviewdog patterns.
.github/workflows/release.yaml Bump action versions.
.github/workflows/issue-metrics.yaml Bump action versions.
.github/workflows/helm.yaml Bump action versions.
.github/workflows/github-actions-workflow-linter.yaml Bump setup-go action version.
.github/workflows/format.yaml Bump action versions.
.github/workflows/e2e-profiling.yaml Bump artifact/cache actions.
.github/workflows/e2e-code-bench-agent.yaml Bump artifact action version.
.github/workflows/dockers-readreplica-rotate-image.yaml Add singleflight to docker build trigger paths.
.github/workflows/dockers-manager-index-image.yaml Add singleflight to docker build trigger paths.
.github/workflows/dockers-index-save-image.yaml Add singleflight to docker build trigger paths.
.github/workflows/dockers-index-operator-image.yaml Add singleflight to docker build trigger paths.
.github/workflows/dockers-index-exportation-image.yaml Add singleflight to docker build trigger paths.
.github/workflows/dockers-index-deletion-image.yaml Add singleflight to docker build trigger paths.
.github/workflows/dockers-index-creation-image.yaml Add singleflight to docker build trigger paths.
.github/workflows/dockers-index-correction-image.yaml Add singleflight to docker build trigger paths.
.github/workflows/dockers-gateway-mirror-image.yaml Add singleflight to docker build trigger paths.
.github/workflows/dockers-gateway-lb-image.yaml Add singleflight to docker build trigger paths.
.github/workflows/dockers-gateway-filter-image.yaml Add singleflight to docker build trigger paths.
.github/workflows/dockers-benchmark-operator-image.yaml Add singleflight to docker build trigger paths.
.github/workflows/dockers-benchmark-job-image.yaml Add singleflight to docker build trigger paths.
.github/workflows/dockers-agent-sidecar-image.yaml Add singleflight to docker build trigger paths.
.github/workflows/dockers-agent-faiss-image.yaml Add singleflight to docker build trigger paths.
.github/workflows/coverage.yaml Bump codecov action version.
.github/workflows/codeql-analysis.yml Bump codeql actions.
.github/workflows/chatops.yaml Bump gpg import action version.
.github/workflows/build-binaries.yaml Bump artifact/download actions and upload-release-asset action.
.github/workflows/backport.yaml Bump gpg import action version.
.github/workflows/_release-pr.yaml Bump gpg import action version.
.github/workflows/_docker-image.yaml Bump docker login & buildx actions.
.github/actions/setup-helm/action.yaml Bump azure/setup-helm action version.
.github/actions/setup-go/action.yaml Bump actions/setup-go action version.
.github/actions/scan-docker-image/action.yaml Bump codeql upload-sarif action version.
.github/actions/e2e-profile/action.yaml Bump artifact/cache actions.
.github/PULL_REQUEST_TEMPLATE.md Update environment version hints.
.github/ISSUE_TEMPLATE/security_issue_report.md Update environment version hints.
.github/ISSUE_TEMPLATE/bug_report.md Update environment version hints.
.gitfiles Add new benchmark/config paths to file list.
Comments suppressed due to low confidence (9)

internal/servers/server/server.go:1

  • Shutdown collects listener close errors into rerr but always returns nil, which drops real shutdown failures. Return rerr (and keep it nil when no errors) so callers can detect shutdown problems.
    internal/net/grpc/pool/pool.go:1
  • refreshConn now runs dials asynchronously and explicitly suppresses dial errors (return nil // do not propagate), which can make Connect() succeed even when all connections failed to dial. Consider tracking dial failures (e.g., atomic failure counter / first error capture) and have connect()/singleTargetConnect() return an error when no slot becomes usable within the connect call context.
    internal/net/grpc/pool/pool.go:1
  • TransientFailure is treated as healthy, which can cause IsHealthy() (and callers relying on it) to report a pool as healthy even when RPCs are likely to fail or be backed off. If you want to keep the conn for retry/backoff, keep it open, but consider reporting TransientFailure as unhealthy (or split 'usable' vs 'healthy' semantics).
    internal/net/grpc/pool/pool.go:1
  • This loop iterates up to p.poolSize.Load() but indexes into slots with special-casing when idx >= slen (falling back to the last slot). If poolSize can ever exceed len(slots) during resize/update windows, this will repeatedly operate on the last slot and skew health/refresh behavior. Safer approach: compute limit := min(p.poolSize.Load(), uint64(len(slots))) (or ensure slots length is always == pool size) and iterate only limit elements.
    internal/net/grpc/pool/pool_bench_gethealthy_test.go:1
  • These benchmarks create and connect a pool but never disconnect/close it. Add a cleanup (e.g., defer p.Disconnect(ctx) or equivalent) so repeated benchmark runs don’t leak goroutines/connections and skew results.
    internal/net/grpc/pool/pool_bench_test.go:1
  • BenchmarkPool_HighContention connects a pool but does not disconnect it. Please add cleanup to close connections after the benchmark to avoid leaking resources and contaminating subsequent benchmarks.
    Makefile.d/docker.mk:1
  • Adding --network host to all local Docker builds increases risk (build steps can access host network services) and can reduce build reproducibility. Consider gating this behind an opt-in flag (e.g., DOCKER_BUILD_NETWORK ?= default and only add --network host when explicitly requested).
    hack/helm/schema/crd/main.go:1
  • This returns result.String() twice and slices using a length computed from a separate result.String() call. It’s inefficient and can panic when the accumulated string is empty (len == 0). Store the string once in a local variable and only slice when length > 0 (or avoid slicing by writing conditionally).
    internal/net/grpc/pool/pool.go:1
  • The PR description emphasizes dependency/workflow updates, but this PR also includes substantial behavioral changes to gRPC pooling/client connection logic (async refresh, health semantics, defaults, metrics). Consider updating the PR description to explicitly call out these runtime behavior changes (and any rollout/compatibility notes) so reviewers/operators don’t miss them.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@kpango
Copy link
Copy Markdown
Collaborator Author

kpango commented Mar 31, 2026

/format

@vdaas-ci
Copy link
Copy Markdown
Collaborator

[FORMAT] Updating license headers and formatting go codes triggered by kpango.

@kpango
Copy link
Copy Markdown
Collaborator Author

kpango commented Mar 31, 2026

/format

@vdaas-ci
Copy link
Copy Markdown
Collaborator

[FORMAT] Updating license headers and formatting go codes triggered by kpango.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 143 out of 166 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (11)

internal/servers/server/server.go:1

  • Shutdown accumulates errors into rerr but always returns nil. This will hide listener-close failures (and any joined errors). Return rerr instead of nil so callers can observe shutdown errors.
    internal/net/grpc/pool/pool.go:1
  • refreshConn swallows dial errors by returning nil from the goroutine even when p.dial fails and there is no healthy existing connection. This can cause Connect() to report success while the pool has no usable connections. Consider propagating dial errors (or at least returning an error when all refreshes fail / no healthy conns exist) so callers can react appropriately.
    internal/net/grpc/pool/pool.go:1
  • refreshConn swallows dial errors by returning nil from the goroutine even when p.dial fails and there is no healthy existing connection. This can cause Connect() to report success while the pool has no usable connections. Consider propagating dial errors (or at least returning an error when all refreshes fail / no healthy conns exist) so callers can react appropriately.
    internal/net/grpc/pool/pool.go:1
  • TransientFailure is treated as healthy here. In gRPC, TransientFailure generally indicates the connection is not ready for RPCs and will often immediately fail calls, which can increase observed latency. If the intent is to keep a failing conn around for backoff retries, consider distinguishing 'keep for retry' vs 'healthy for use': return false for TransientFailure so Get/IsHealthy can avoid selecting it while still optionally triggering refresh/reconnect logic.
    internal/net/grpc/stats/stats.go:1
  • subPath parsed from /proc/self/cgroup typically starts with / (e.g. /system.slice/...). If file.Join follows filepath.Join semantics, joining cgroupBasePath with an absolute subPath will discard cgroupBasePath and produce the wrong candidate path. Trim the leading slash (e.g. strings.TrimPrefix(subPath, \"/\")) before joining so the candidate resolves under cgroupBasePath.
    hack/helm/schema/crd/main.go:1
  • This implementation repeatedly calls indent.String() and also calls result.String() twice on the return line, which creates avoidable allocations. Also, result.String()[:len(...)-1] is fragile and harder to read. Prefer computing indentStr := indent.String() once, and compute out := result.String() once before slicing/returning.
    hack/helm/schema/crd/main.go:1
  • This implementation repeatedly calls indent.String() and also calls result.String() twice on the return line, which creates avoidable allocations. Also, result.String()[:len(...)-1] is fragile and harder to read. Prefer computing indentStr := indent.String() once, and compute out := result.String() once before slicing/returning.
    hack/docker/gen/main.go:1
  • The RUSTFLAGS value here includes literal quotes (\"...\"). When exported into the container environment, those quotes may become part of the variable value and break rustc flag parsing. Store the raw flags without wrapping quotes so the environment variable is exactly -Clinker=clang -Clink-arg=-fuse-ld=lld.
    Makefile.d/docker.mk:1
  • Adding --network host reduces build isolation (the Dockerfile build steps can access host services and networks) and can behave differently across environments/runtimes. Consider making this conditional (e.g., behind a Make variable defaulting to off) so users/CI can opt in only when necessary.
    Makefile:1
  • go fix can perform code rewrites (not just formatting) and may introduce unexpected diffs when developers run make format/go. Consider moving go fix to a separate explicit target (or gating it behind a variable) to keep formatting deterministic and narrowly scoped.
    Makefile:1
  • go fix can perform code rewrites (not just formatting) and may introduce unexpected diffs when developers run make format/go. Consider moving go fix to a separate explicit target (or gating it behind a variable) to keep formatting deterministic and narrowly scoped.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@kpango
Copy link
Copy Markdown
Collaborator Author

kpango commented Apr 2, 2026

/format

@vdaas-ci
Copy link
Copy Markdown
Collaborator

vdaas-ci commented Apr 2, 2026

[FORMAT] Updating license headers and formatting go codes triggered by kpango.

Signed-off-by: kpango <kpango@vdaas.org>
| | description | how does it affect? | Memo |
| :--------------------------- | :--------------------------------------------------------------------------------------------------------------------------------------------------------- | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :------------------ |
| agent.ngt.creation_edge_size | Number of nodes connected to one node<BR>[ref: no_of_edges](https://github.com/NGT-labs/NGT/tree/main/bin/ngt#create) | It helps reduce unreachable edges.<BR>The larger it is, the denser the graph structure will be, but the memory usage, search speed, and index construction time will increase accordingly. | default value: `20` |
| agent.ngt.search_edge_size | Number of nodes to search from the origin node when searching<BR>[ref: no_of_edges_at_search_at](https://github.com/NGT-labs/NGT/tree/main/bin/ngt#create) | The number of nodes to search will increase.<BR>Accuracy will be higher, but speed will be lower.<BR>Adjust if adjusting the radius and epsilon does not improve the situation. | default value: `10` |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[LanguageTool] reported by reviewdog 🐶
Did you mean “lowed”, “lowered”? (MUST_BE_DO[2])
Suggestions: lowed, lowered
Rule: https://community.languagetool.org/rule/show/MUST_BE_DO?lang=en-US&subId=2
Category: GRAMMAR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants