Fix gRPC broadcast latency and tune connection pool by kpango · Pull Request #3478 · vdaas/vald

kpango · 2026-02-18T06:08:22Z

This pull request primarily updates dependencies and environment versions used in CI/CD workflows, issue templates, and adds new benchmark and configuration files to the codebase. The changes are focused on keeping the project up-to-date with the latest tools, improving test coverage, and ensuring accurate environment information.

Dependency and workflow updates:

Upgraded several GitHub Actions to their latest versions, including actions/upload-artifact, actions/download-artifact, shogo82148/actions-upload-release-asset, crazy-max/ghaction-import-gpg, and github/codeql-action across multiple workflow files for improved reliability and security. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11]

Environment version updates:

Updated Go, Rust, Docker, Kubernetes, Helm, NGT, and Faiss versions in issue templates and the pull request template to reflect the latest releases and ensure accurate environment reporting. [1] [2] [3]

Test and benchmark coverage improvements:

Added new benchmark test files: internal/net/grpc/benchmark_test.go and internal/net/grpc/pool/pool_bench_gethealthy_test.go to enhance performance testing of gRPC components. [1] [2]

File and configuration additions:

Introduced rust/bin/meta/src/config.rs to the Rust meta binary, likely for improved configuration management.

Build trigger enhancements:

Included internal/sync/singleflight/*.go in the build triggers for various Docker image workflows, ensuring changes to this package are properly tracked and tested. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16]

google-labs-jules · 2026-02-18T06:08:23Z

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.

For security, I will only act on instructions from the user who triggered this task.

cloudflare-workers-and-pages · 2026-02-18T06:08:24Z

Deploying vald with Cloudflare Pages

Latest commit:	`326a4a2`
Status:	✅ Deploy successful!
Preview URL:	https://8c7b037b.vald.pages.dev
Branch Preview URL:	https://fix-grpc-broadcast-latency-1.vald.pages.dev

View logs

vdaas-ci · 2026-02-18T06:08:31Z

[CHATOPS:HELP] ChatOps commands.

🙆‍♀️ /approve - approve
🍱 /format - format codes and add licenses
✅ /gen-test - generate test codes
🏷️ /label - add labels
🔚 2️⃣ 🔚 /label actions/e2e-deploy - run E2E deploy & integration test

coderabbitai · 2026-02-18T06:08:37Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds per-address singleflight synchronization to the gRPC client, refactors connection-pool health/refresh into async flows using gRPC connectivity states with optional metrics, increases default pool sizing and keepalive, updates gateway broadcast concurrency, adds benchmarks, expands CI triggers for singleflight, and bumps several Rust deps.

Changes

Cohort / File(s)	Summary
gRPC client core `internal/net/grpc/client.go`	Adds per-address `singleflight.Group`, `enablePoolMetrics` flag; serializes Connect/Disconnect via group.Do; refactors connect/reconnect, pool creation, and RangeConcurrent behavior.
Connection pool `internal/net/grpc/pool/pool.go`, `internal/net/grpc/pool/option.go`	Adds `enableMetrics` option; changes health checks to use gRPC connectivity states; introduces async refreshConn via errgroup and altered dial/refresh semantics.
gRPC options & config `internal/net/grpc/option.go`, `internal/config/grpc.go`, `charts/vald/values.yaml`	Increases default pool size and adds keepalive params; exposes `WithEnableConnectionPoolMetrics` and `ConnectionPool.EnableMetrics` config flag.
Benchmarks & tests `internal/net/grpc/benchmark_test.go`, `internal/net/grpc/pool/pool_bench_test.go`	Adds bufconn-based broadcast benchmark and a high-contention pool benchmark.
Gateway behavior `pkg/gateway/lb/service/gateway.go`	Broadcast now uses dynamic concurrency via `GetAgentCount(fctx)` instead of fixed -1.
CI workflows `.github/workflows/...`	Adds `internal/sync/singleflight/*.go` to multiple workflow path filters so CI triggers on singleflight package changes.
Config & charts `charts/vald/values.yaml`	Adds `defaults.grpc.client.connection_pool.enable_metrics` boolean defaulting to false.
Rust deps `rust/libs/*/Cargo.toml`	Bumps several Rust dependency versions (e.g., `tonic`, `futures-core`, `wincode`).
New tests file `internal/net/grpc/benchmark_test.go`	Introduces a bufconn-based mock broadcast server and BenchmarkBroadcast.

Sequence Diagram(s)

sequenceDiagram
    participant Client as gRPC Client
    participant SF as singleflight (per-addr)
    participant Pool as Connection Pool
    participant Server as Remote gRPC Server

    Client->>SF: Connect(addr) [group.Do]
    SF->>Pool: ensure/create/upgrade pool
    Pool->>Pool: inspect connectivity state (Ready/Idle/Connecting/TransientFailure/Shutdown)
    alt Idle
        Pool->>Pool: trigger async refresh/dial (errgroup)
        Pool-->>SF: supply connection when ready
    else Ready or Connecting
        Pool-->>SF: return connection
    else TransientFailure
        Pool-->>SF: return last-resort conn and schedule refresh
    else Shutdown
        Pool->>Pool: async dial new conn
    end
    SF-->>Client: connection returned
    Client->>Server: RPCs via chosen connection
    Client->>SF: Disconnect(addr) [group.Do]
    SF->>Pool: serialized teardown/replace
    Pool->>Pool: close/replace slots

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Backport PR #3079 to release/v1.7 for add executeRPC function for gRPC client #3109 — Modifies gRPC client and pool internals; related to connection/pool management changes.
add executeRPC function for gRPC client #3079 — Refactors client/pool logic and RangeConcurrent behavior overlapping with these edits.
Backport PR #2685 to release/v1.7 for Add New gRPC Options and Add Reconnect Logic for connection Pool #2693 — Adds metrics/options and reconnect logic in gRPC client/pool similar to this PR.

Suggested labels

type/feature, type/dependency

Suggested reviewers

kmrmt
Matts966
datelier

Poem

Singleflight queues one call in line,
Pools wake and sleep by connectivity sign.
Bufconn hums benchmarks in-memory bright,
CI learns to watch the singleflight.
Metrics sleep off — now ready to shine.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title directly captures the two main objectives: fixing gRPC broadcast latency and tuning the connection pool configuration.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix-grpc-broadcast-latency-17105619334351837761

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-02-18T06:17:41Z

Codecov Report

❌ Patch coverage is 15.29052% with 277 lines in your changes missing coverage. Please review.
✅ Project coverage is 26.31%. Comparing base (70de684) to head (1fade9f).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
internal/net/grpc/pool/pool.go	0.00%	137 Missing ⚠️
internal/net/grpc/client.go	0.00%	122 Missing ⚠️
hack/helm/schema/crd/main.go	0.00%	6 Missing ⚠️
internal/servers/server/server.go	50.00%	4 Missing and 1 partial ⚠️
internal/net/grpc/option.go	0.00%	3 Missing ⚠️
internal/net/grpc/pool/option.go	33.33%	1 Missing and 1 partial ⚠️
hack/benchmark/internal/core/algorithm/ngt/ngt.go	0.00%	1 Missing ⚠️
hack/license/gen/main.go	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3478      +/-   ##
==========================================
+ Coverage   25.02%   26.31%   +1.28%     
==========================================
  Files         544      571      +27     
  Lines       43800    50472    +6672     
==========================================
+ Hits        10963    13283    +2320     
- Misses      32108    36209    +4101     
- Partials      729      980     +251

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

coderabbitai

Actionable comments posted: 7

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/net/grpc/benchmark_test.go`:
- Around line 49-86: Move the setup of the gRPCClient, population of
client.conns (mockPool entries) and atomic.StoreUint64(&client.clientCount, ...)
so it runs before starting the benchmark timers; then call b.ReportAllocs() and
b.ResetTimer() once before the sub-benchmarks in BenchmarkBroadcast. Inside each
sub-benchmark replace the old for i := 0; i < b.N; i++ loops with the Go 1.24
style for b.Loop() loop, and keep the call to client.RangeConcurrent(...) and
its simulated latency inside that loop. Ensure you reference the
BenchmarkBroadcast function, the gRPCClient instance (client),
client.conns/mockPool population, and the RangeConcurrent call when making these
changes.
- Around line 36-47: The mockPool currently hands out zero-value grpc.ClientConn
structs (see mockPool.Do, mockPool.Get, mockPool.Reconnect) which violates gRPC
construction rules; replace those returns with real, properly-initialized
*grpc.ClientConn instances created by dialing a test server (use net.Listen +
grpc.NewServer or an in-process bufconn as done in pool_bench_test.go), update
the benchmark setup to start and tear down the test gRPC server and dial a
client connection once, and have mockPool methods return that *grpc.ClientConn
instead of &grpc.ClientConn{} so any later method calls on the connection are
safe.

In `@internal/net/grpc/option.go`:
- Line 43: The change increasing the connection pool to
WithConnectionPoolSize(10) can multiply persistent gRPC connections (e.g., 100
agents → 1,000 connections) and risk hitting OS file descriptor and memory
limits; update the configuration and docs to make the pool size safe and
configurable, validate/adjust WithIdleTimeout to reclaim idle connections, and
add a checklist/test to verify target nodes' ulimit -n and pod memory limits for
the expected scale. Specifically, ensure WithConnectionPoolSize is
documented/configurable (or reduced), tune WithIdleTimeout to a shorter reclaim
window, add runtime validation or warnings if computed total FDs may exceed
ulimit, and include guidance in deployment manifests about required pod memory
and ulimit values.
- Around line 43-44: The hardcoded client keepalive in WithKeepaliveParams
currently sets PermitWithoutStream=true which will cause servers (with default
permit_without_stream=false and min_time=10m) to GOAWAY; update the
WithKeepaliveParams usage in internal/net/grpc/option.go to use
PermitWithoutStream=false (or remove/disable the aggressive 30s keepalive) so
client pings won’t be sent without active streams, or alternatively coordinate
server-side config changes before landing; change the WithKeepaliveParams call
accordingly to prevent immediate GOAWAYs from Vald servers.

In `@internal/net/grpc/pool/pool_bench_test.go`:
- Around line 237-246: The benchmark currently runs measurement without
bracketing timers or reporting allocations; update the benchmark surrounding the
b.RunParallel(...) loop to match the other benchmarks by calling b.StopTimer(),
b.ReportAllocs(), b.ResetTimer(), and b.StartTimer() before entering the
b.RunParallel block (so pool setup and connection time are excluded) and call
b.StopTimer() immediately after the parallel loop; modify the same benchmark
that invokes p.Do(...) on *ClientConn to include these calls.
- Around line 212-216: Remove the stale development-commentary block that spans
the scratch notes referencing "pool_bench_test.go" and the
ListenAndServe/pool.New ideas (the comment lines between the existing test setup
and helper usage); simply delete those lines so the test file contains only
meaningful comments and code around the test setup (where ListenAndServe and
pool.New are used) to keep the test tidy.

In `@pkg/gateway/lb/service/gateway.go`:
- Line 91: The tracing span was started on fctx but GetAgentCount was called
with the stale ctx; change the call to use the traced context (call
g.GetAgentCount(fctx)) so the address lookup is attributed to the BroadCast
trace span, and ensure any other calls inside the same return expression use
fctx consistently (e.g., client.RangeConcurrent(fctx, g.GetAgentCount(fctx),
...)).

internal/net/grpc/benchmark_test.go

internal/net/grpc/option.go

internal/net/grpc/pool/pool_bench_test.go

pkg/gateway/lb/service/gateway.go

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

internal/net/grpc/pool/pool.go (2)
348-350: 🧹 Nitpick | 🔵 Trivial

Dead code: count == 0 is unreachable after the sz == 0 guard

The early-return at lines 328–330 guarantees sz > 0 before the loop. Since every iteration either increments count (in the default branch) or returns early via ctx.Done(), count is always ≥ 1 when execution reaches line 348. The block can never trigger.
🧹 Proposed cleanup
  }
- if count == 0 {
- 	return errors.ErrGRPCPoolConnectionNotFound
- }
  return nil
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/net/grpc/pool/pool.go` around lines 348 - 350, Remove the
unreachable dead branch that checks `if count == 0 { return
errors.ErrGRPCPoolConnectionNotFound }` in the function that iterates over the
pool (the loop guarded by `sz == 0` and using `count` and `ctx.Done()`). Because
`sz > 0` is ensured before the loop and every loop path either increments
`count` or returns on `ctx.Done()`, delete this `count == 0` check to clean up
dead code and leave the remaining return paths intact.
287-300: ⚠️ Potential issue | 🟡 Minor

Remove load() method — confirmed unused and contains bounds-checking bug

The method is dead code: no active calls exist in the codebase, and its test is commented out. More critically, the bounds check at line 294 is inverted: if sz < idx should be if idx >= sz. If that condition evaluates true, line 295 executes slots[sz].Load() where sz == len(slots), causing an out-of-bounds panic.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/net/grpc/pool/pool.go` around lines 287 - 300, Remove the dead and
unsafe method pool.load: it is unused and contains an inverted bounds check that
can index slots[len(slots)] causing an OOB panic; delete the load(idx uint64)
(ridx uint64, pc *poolConn) implementation from the pool type and any commented
tests referencing it, and run tests to confirm nothing else depends on the
function name load in type pool (search for method name "load" and type "pool"
to ensure no remaining callers).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/net/grpc/pool/pool_bench_test.go`:
- Around line 211-213: Replace the b.Fatal(err) call after p.Connect(ctx) with
b.Error(err) so the benchmark follows the file-wide convention used in
Benchmark_ConnPool, BenchmarkParallel_ConnPool and ListenAndServe; keep the
p.Connect(ctx) error path but use b.Error(err) (and ensure any subsequent code
remains safe if p is nil).

---

Outside diff comments:
In `@internal/net/grpc/pool/pool.go`:
- Around line 348-350: Remove the unreachable dead branch that checks `if count
== 0 { return errors.ErrGRPCPoolConnectionNotFound }` in the function that
iterates over the pool (the loop guarded by `sz == 0` and using `count` and
`ctx.Done()`). Because `sz > 0` is ensured before the loop and every loop path
either increments `count` or returns on `ctx.Done()`, delete this `count == 0`
check to clean up dead code and leave the remaining return paths intact.
- Around line 287-300: Remove the dead and unsafe method pool.load: it is unused
and contains an inverted bounds check that can index slots[len(slots)] causing
an OOB panic; delete the load(idx uint64) (ridx uint64, pc *poolConn)
implementation from the pool type and any commented tests referencing it, and
run tests to confirm nothing else depends on the function name load in type pool
(search for method name "load" and type "pool" to ensure no remaining callers).

---

Duplicate comments:
In `@internal/net/grpc/pool/pool_bench_test.go`:
- Around line 216-226: The benchmark BenchmarkPool_HighContention is missing the
standard measurement bracketing and is including setup latency and allocations;
before entering the parallel loop around p.Do(...) call stop the timer and call
b.ReportAllocs() then call b.StartTimer(), and after the parallel RunParallel
returns call b.StopTimer(); ensure you place these calls in the same pattern
used by the other benchmarks in this file so the allocation and setup overhead
are excluded from the measured loop.

internal/net/grpc/pool/pool_bench_test.go

coderabbitai

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/net/grpc/e2e_bench_test.go`:
- Around line 144-146: The code currently logs a non-fatal warning when
connectivity is partial (the time.Since(start) > 5*time.Second branch) which
allows the benchmark to continue with fewer than numAgents; change this to fail
the benchmark immediately by replacing the warning log with a fatal failure (use
b.Fatalf or equivalent) so the benchmark halts when connected < numAgents after
the timeout; update the branch that references start, connected, and numAgents
in e2e_bench_test.go to call b.Fatalf with a clear message including connected
and numAgents.
- Around line 88-101: startBenchServer currently calls b.Fatalf on error which
panics when invoked from goroutines; change startBenchServer to return (addr
string, td func(), err error) instead of calling b.Fatalf, update the goroutine
to capture the returned err into a shared errors slice (protected by setupMu)
along with addrs and teardowns, and after setupWg.Wait() check the collected
errors and call b.Fatalf (or b.Fatalf with combined error info) from the
benchmark goroutine if any error occurred; update any callers of
startBenchServer to handle the new error return.
- Line 178: The test currently calls client.Close(ctx) at the end (client.Close)
which leaks connections if a sub-benchmark panics; after creating the client
(the variable named client in e2e_bench_test.go) immediately call defer
client.Close(ctx) so the connection is always cleaned up on panic or early
return, and remove the standalone client.Close(ctx) call at line 178.
- Around line 59-72: The error channel ech created for srv.ListenAndServe is
never read so startup failures are ignored; change the logic to drain/check ech
to detect bind/startup errors by launching a background goroutine that reads
from ech and records any error (or use a non-blocking select/read after the
sleep) and fail the test/benchmark if an error is received; also replace the
fragile time.Sleep with a readiness check or by waiting for either a successful
"started" signal or an error from ech (use the existing ctx/cancel and wg to
coordinate shutdown).
- Around line 83-84: The benchmark silently swallows port conflicts because it
uses hardcoded ports and discards srv.ListenAndServe() errors; fix by adding a
WithListener(server.New option) to accept a pre-bound net.Listener and a method
like BoundAddr()/Addr() on the server to retrieve the actual bound address, then
in the test do l, _ := net.Listen("tcp", ":0"), pass l via WithListener to
server.New, obtain the bound address via BoundAddr()/Addr() and use that for
agents instead of fixed ports; additionally capture srv.ListenAndServe() errors
by sending them on a startup error channel and fail fast (or log and abort) if
no successful start within the poll timeout instead of ignoring the error, or
alternatively use IsConnected() polling against the server address to ensure all
agents are reachable before proceeding.

internal/net/grpc/e2e_bench_test.go

coderabbitai

Actionable comments posted: 7

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/net/grpc/benchmark_test.go`:
- Around line 66-77: startBroadcastServer currently launches srv.ListenAndServe
in a goroutine and returns immediately, causing the benchmark to busy-spin
retrying connections; change startBroadcastServer to provide a readiness signal
and have the caller wait on it (e.g., reuse the existing ech channel: have
ListenAndServe send a ready/started value or close a dedicated ready channel
once listening) or use a sync.WaitGroup to block until the server is accepting
connections before returning; alternatively, add a small sleep/backoff in the
retry loop that uses addr to avoid tight CPU spinning—update references to
ListenAndServe, ech, startBroadcastServer, and the caller's retry loop to use
the chosen readiness mechanism.
- Around line 44-77: The code has a TOCTOU race: you call net.Listen to get a
free port (l), close it, then create a server that re-binds later (risking
collisions). Modify startBroadcastServer to avoid closing the pre-bound
listener: either (1) change server.New / server.ListenAndServe to accept a
net.Listener (pass l through a new option like server.WithListener or a
constructor overload) and register gRPC on that listener inside
WithGRPCRegisterar, or (2) use an in-memory bufconn listener for the test
(bufconn.Listen) and pass that into the server, or if server supports it use
server.WithHost("localhost") + server.WithPort(0) so the server itself binds an
ephemeral port; if none are possible add a short retry loop when binding. Update
code references: net.Listen (l), server.New, server.WithGRPCRegisterar,
srv.ListenAndServe, and srv.Shutdown to operate on the pre-bound listener
instead of closing l then rebinding.
- Around line 114-127: The retry loop that calls client.Connect in the goroutine
spins without any backoff and uses the standard sync.WaitGroup; change the
goroutine in the benchmark to (1) replace the stdlib WaitGroup with Vald's
internal/sync WaitGroup type (from package "internal/sync") used throughout the
repo, and (2) add a small backoff/sleep between Connect attempts inside the for
i := 0; i < 10; i++ loop (e.g., time.Sleep with a short duration or an
exponential/backoff strategy) so Connect is not hammered; keep the
atomic.AddInt64(&connectedCount, 1) and return behavior unchanged and ensure the
new WaitGroup variable (wg) is the internal/sync version so wg.Done() and
wg.Wait() continue to work.

In `@internal/net/grpc/pool/pool.go`:
- Around line 587-599: Remove the development scratch comments inside the dial
logic: delete the stray internal reasoning lines between the healthy check and
the state handling (the comments mentioning "But NewClient returns Idle
usually..." and "isHealthy returns true for Idle..."). Keep the conditional
behavior intact (use p.isHealthy(idx, conn), inspect conn.GetState() and the
connectivity.TransientFailure branch) but tidy the block so only concise,
production-ready comments remain or no comment at all.
- Around line 655-658: The Debugf call in the connectivity.TransientFailure case
of pool.go has four format verbs but only three arguments (pc.addr, idx+1,
state.String()); add the missing pool-size argument to match the third %d (e.g.,
use the pool size field or computed length such as pc.poolSize or len(pc.conns))
and ensure arguments are ordered as (pc.addr, idx+1, <pool-size>,
state.String()) so the format string and arguments align.
- Around line 665-681: The second-pass loop currently calls p.refreshConn
synchronously (which calls p.dial / grpc.NewClient) and can block request
latency; change the logic in the loop that iterates over slots (where load, pc,
pc.conn and connectivity states are inspected) to not synchronously call
refreshConn but instead schedule the refresh asynchronously via
p.errGroup.Go(func(){ _ = p.refreshConn(ctx, idx, pc, p.addr) }) and immediately
continue/return an error to the caller when no usable connection was found on
the first pass; keep the existing behavior of returning a TransientFailure pc as
a last resort, but ensure refresh attempts are backgrounded on p.errGroup and
the caller is not blocked on network dial timeouts.
- Around line 897-901: The review notes that treating
connectivity.TransientFailure as "healthy" is intentional but non-obvious;
update the code to document this design decision: add a clear comment near the
IsHealthy implementation (and the getHealthyConn function) explaining that
TransientFailure is considered healthy to allow gRPC's exponential backoff and
internal retries to recover, that getHealthyConn performs a two-pass selection
(preferring Ready/Idle/Connecting and only returning TransientFailure as a last
resort), and that Reconnect/Connect logic intentionally skips hard reconnects
for TransientFailure—reference the IsHealthy, getHealthyConn, Reconnect, and
Connect symbols so future readers understand the semantic shift from
"immediately usable" to "not permanently dead."

---

Duplicate comments:
In `@internal/net/grpc/benchmark_test.go`:
- Around line 132-141: The benchmark currently swallows errors and uses a manual
for-loop; update the loop to use b.Loop() (as in pool_bench_test.go) and surface
failures instead of ignoring them: inside the RangeConcurrent call handler (and
after it) replace the commented-out b.Fatalf with either b.Fatalf to fail loudly
on error or b.Error to record the failure while continuing, and ensure the outer
loop uses for b.Loop() rather than for i := 0; i < b.N; i++ so the benchmark
reports meaningful results; reference RangeConcurrent, b.Fatalf/b.Error, and
b.Loop when making the changes.

In `@internal/net/grpc/option.go`:
- Around line 43-44: The client keepalive settings used with WithKeepaliveParams
(ping interval "30s" and PermitWithoutStream true) violate the Vald server
enforcement (min_time: 10m, permit_without_stream: false) causing GOAWAYs;
update the grpc client options used in option.go (e.g., the call site
constructing WithKeepaliveParams and WithConnectionPoolSize) to set
PermitWithoutStream to false and increase the ping interval to at least the
server min_time (e.g., >=10m), or alternatively coordinate and document that
server enforcement vars (min_time, permit_without_stream) will be changed across
Vald before landing this PR.

In `@internal/net/grpc/pool/pool_bench_test.go`:
- Around line 211-246: Remove the stale scratch comments at the start of
BenchmarkPool_HighContention (they add no value), and add benchmark hygiene:
call b.ReportAllocs() before starting measurements, wrap setup in b.StopTimer()
and b.StartTimer() so pool creation/Connect (New, Connect, ListenAndServe) are
excluded from timing, then ensure you call b.StopTimer() after the b.RunParallel
measurement completes; keep the test logic the same.

In `@pkg/gateway/lb/service/gateway.go`:
- Line 91: Call GetAgentCount using the tracing context fctx (not the original
ctx) so the span started on fctx is preserved; change the invocation in
gateway.go from g.GetAgentCount(ctx) to g.GetAgentCount(fctx) so RangeConcurrent
and GetAgentCount share the same trace context and maintain consistent trace
attribution.

internal/net/grpc/benchmark_test.go

internal/net/grpc/pool/pool.go

kpango · 2026-02-19T10:18:34Z

@coderabbitai full review

coderabbitai · 2026-02-19T10:18:48Z

✅ Actions performed

Full review triggered.

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/net/grpc/benchmark_test.go`:
- Around line 132-140: The benchmark currently ignores errors returned by
client.RangeConcurrent which can hide RPC failures; update the benchmark loop in
benchmark_test.go to check the err from client.RangeConcurrent and call b.Fatalf
(or b.Error/b.FailNow) with a descriptive message including err to fail the
benchmark on RPC errors; also ensure the anonymous function passed to
RangeConcurrent returns any inner RPC error (from c.Nodes) so the outer err
reflects failures from the inner call.
- Line 142: Add the required unit-test delimiter by inserting the comment line
"// NOT IMPLEMENTED BELOW" immediately after the b.StopTimer() call in
benchmark_test.go (the block containing the b.StopTimer() statement); this will
separate implemented tests from unimplemented ones per repo guidelines.

In `@internal/net/grpc/client.go`:
- Around line 1047-1098: When swapping in a newly created pool in the code that
calls pool.New/conn.Connect, capture any existing connection from
g.conns.Load(addr) before replacing it; after the new conn is connected and
health-checked, if an oldConn existed call g.Disconnect(ctx, addr) (join any
disconnect error into the returned error or log it) and do not increment
g.clientCount; only atomic.AddUint64(&g.clientCount, 1) when there was no
previous connection for that addr. Ensure g.conns.Store(addr, conn) happens
after the successful swap so the new conn replaces the old one and avoid leaking
the old pool.

In `@internal/net/grpc/pool/pool_bench_test.go`:
- Line 247: Add the required placeholder by inserting a single line containing
"// NOT IMPLEMENTED BELOW" on its own line in pool_bench_test.go immediately
before the file end (i.e., after the last implemented test / before the final
closing brace or trailing newline) so the test file includes the delimiter
separating implemented tests from unimplemented ones.

---

Duplicate comments:
In `@internal/net/grpc/benchmark_test.go`:
- Around line 43-50: The test suffers a TOCTOU race by listening on
"localhost:0", closing the listener, then re-binding later in
startBroadcastServer; change startBroadcastServer to accept a pre-bound
net.Listener (or create and use a bufconn listener) so the server uses the same
listener instead of re-binding, or modify startBroadcastServer to let the server
bind to port 0 itself and retrieve the actual port from the server's listener;
update call sites that currently expect a string port to instead provide the
listener (or adapt to the server-returned address) and keep references to
startBroadcastServer and any errgroup usage consistent.
- Around line 110-126: The retry loop for connecting to agents currently spins
tightly and uses the stdlib sync.WaitGroup; replace this with Vald's internal
sync utilities (e.g., internal/sync or internal/sync/errgroup) and add a short
backoff between retries to avoid busy looping: in the goroutine that calls
client.Connect(ctx, a) for each addr in addrs, use the internal errgroup or
waitgroup equivalent instead of wg, and on each failed attempt sleep for a small
duration (e.g., a few milliseconds) before retrying; ensure connectedCount
(atomic.AddInt64) is still incremented on success and that the new sync helper
is used to wait for all goroutines to finish.

In `@internal/net/grpc/option.go`:
- Around line 42-44: The default keepalive options currently set in
defaultOptions use WithKeepaliveParams("30s", "10s", true) which enables
PermitWithoutStream and can trigger server GOAWAY when servers enforce
min_time/permit_without_stream; change the default to disable
PermitWithoutStream (set the third argument to false) or replace the default
WithKeepaliveParams call with a safer value (e.g.,
WithKeepaliveParams("30s","10s", false)) so PermitWithoutStream is not enabled
by default; update any related comments and ensure WithConnectionPoolSize(10)
remains unchanged.

In `@internal/net/grpc/pool/pool_bench_test.go`:
- Around line 211-216: Remove the development scratch comments inside the
BenchmarkPool_HighContention function: delete the multi-line notes about mock
server, dialer, and ListenAndServe so the benchmark contains only the actual
test setup and benchmark code; keep the function signature
BenchmarkPool_HighContention and any real setup/helper calls, but remove all
commented-out planning text to clean up the test.
- Around line 237-246: The benchmark in pool_bench_test.go is measuring setup
time/allocs because timers aren't bracketed: wrap the setup work with
b.StopTimer()/b.StartTimer() like the other benchmarks so only the p.Do loop is
measured; specifically call b.StopTimer() before any setup done prior to
b.RunParallel (if any), perform setup, then call b.StartTimer() immediately
before b.RunParallel that invokes p.Do and ClientConn, ensuring the
b.ResetTimer() remains where appropriate to clear prior measurements.

In `@internal/net/grpc/pool/pool.go`:
- Around line 587-599: Remove the informal "development scratch" comments inside
the dial logic around p.isHealthy, NewClient and the state handling (the
multi-line notes about Idle, TransientFailure, Shutdown, and "we might want to
keep it"); replace them with a short, descriptive comment (or none) that clearly
states intent—for example: "If connection is in TransientFailure, retain and let
it retry; otherwise close." Ensure references to conn.GetState(),
connectivity.TransientFailure and p.isHealthy(idx, conn) remain untouched
functionally, only the explanatory comments are cleaned up.

internal/net/grpc/benchmark_test.go

internal/net/grpc/client.go

internal/net/grpc/pool/pool_bench_test.go

kmrmt · 2026-02-19T11:06:37Z

/format

vdaas-ci · 2026-02-19T11:07:30Z

[FORMAT] Updating license headers and formatting go codes triggered by kmrmt.

kpango · 2026-03-16T08:38:39Z

@copilot Can you fix like below?

Great suggestion! Adding a reproducible CI container (custom runner image) ensures your build environment is consistent and includes all required tools (like gzip/pigz), reducing platform-related errors. Here’s how you can do it for your workflow in vdaas/vald:

1. Create a Dockerfile for your CI container

Place this file under .github/ci/Dockerfile (or wherever suits your repo):

FROM ubuntu:22.04

# Install required build tools
RUN apt-get update && \
    apt-get install -y \
        build-essential \
        git \
        pigz \
        curl \
        ca-certificates \
        docker.io \
        # add any additional tools you need

    # Clean up
    && apt-get clean && rm -rf /var/lib/apt/lists/*

Adapt the list based on what your build needs (e.g., other language runtimes/tools).

2. Update your GitHub Actions workflow

In .github/workflows/dockers-agent-ngt-image.yaml, use the custom container for your CI job:

jobs:
  build-agent-ngt:
    runs-on: ubuntu-latest
    container:
      image: ghcr.io/vdaas/vald/ci-agent:latest
      options: --privileged # If you need nested Docker
    steps:
      - uses: actions/checkout@v4
      - name: Prune Docker Build Cache
        run: docker builder prune --force
      - name: Build Agent NGT Image
        run: docker build --no-cache -t agent-ngt .

3. Build and push your CI container to GitHub Container Registry

Build the CI image locally:

docker build -t ghcr.io/vdaas/vald/ci-agent:latest -f .github/ci/Dockerfile .

Login and push:

docker login ghcr.io
docker push ghcr.io/vdaas/vald/ci-agent:latest

4. Fixing the build errors

With this dedicated CI container, you ensure tools like pigz, gzip, docker, etc. are always available and up-to-date. If you need specific versions or additional tools, just add them to your Dockerfile.

Summary:
Adding a custom CI container solves issues caused by missing system utilities and inconsistent environments. Update your workflow to reference this container, and your builds will be stable, reproducible, and tool-complete.

Let me know if you want a PR to add these files and workflow updates!

Copilot · 2026-03-16T08:38:49Z

@kpango I've opened a new pull request, #3503, to work on those changes. Once the pull request is ready, I'll request review from you.

CLAassistant · 2026-03-16T18:47:02Z

All committers have signed the CLA.

docs/performance/tuning-search-performance.md

kpango · 2026-03-31T17:17:17Z

/format

vdaas-ci · 2026-03-31T17:20:08Z

[FORMAT] Updating license headers and formatting go codes triggered by kpango.

Copilot

Pull request overview

This PR updates CI/dependency/tooling versions and introduces several gRPC connection-pool and server shutdown changes aimed at reducing broadcast latency, improving pool behavior, and adding optional pool metrics/config knobs.

Changes:

Bump pinned GitHub Actions/tool versions and environment version references (Go/Rust/Docker/K8s/Helm/etc.).
Update gRPC client/pool behavior (pool sizing defaults, keepalive defaults, async refresh, metrics toggle) and improve server shutdown to close listeners.
Add/extend tests and benchmarks (gRPC pool/getHealthyConn, broadcast benchmarks, cgroup stats determinism) and introduce Rust meta config loading/CLI args.

Reviewed changes

Copilot reviewed 140 out of 148 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
versions/actions/SOFTPROPS_ACTION_GH_RELEASE	Bump pinned GitHub Action version.
versions/actions/SHOGO82148_ACTIONS_UPLOAD_RELEASE_ASSET	Bump pinned GitHub Action version.
versions/actions/GITHUB_ISSUE_METRICS	Bump pinned GitHub Action version.
versions/actions/GITHUB_CODEQL_ACTION_UPLOAD_SARIF	Bump pinned GitHub Action version.
versions/actions/GITHUB_CODEQL_ACTION_INIT	Bump pinned GitHub Action version.
versions/actions/GITHUB_CODEQL_ACTION_AUTOBUILD	Bump pinned GitHub Action version.
versions/actions/GITHUB_CODEQL_ACTION_ANALYZE	Bump pinned GitHub Action version.
versions/actions/DOCKER_SETUP_BUILDX_ACTION	Bump pinned GitHub Action version.
versions/actions/DOCKER_LOGIN_ACTION	Bump pinned GitHub Action version.
versions/actions/CRAZY_MAX_GHACTION_IMPORT_GPG	Bump pinned GitHub Action version.
versions/actions/CODECOV_CODECOV_ACTION	Bump pinned GitHub Action version.
versions/actions/AZURE_SETUP_HELM	Bump pinned GitHub Action version.
versions/actions/ACTIONS_UPLOAD_ARTIFACT	Bump pinned GitHub Action version.
versions/actions/ACTIONS_SETUP_GO	Bump pinned GitHub Action version.
versions/actions/ACTIONS_DOWNLOAD_ARTIFACT	Bump pinned GitHub Action version.
versions/actions/ACTIONS_CACHE	Bump pinned GitHub Action version.
versions/ZLIB_VERSION	Update bundled tool/library version pin.
versions/YQ_VERSION	Update tool version pin.
versions/TELEPRESENCE_VERSION	Update tool version pin.
versions/RUST_VERSION	Update Rust version pin.
versions/PROTOBUF_VERSION	Update tool version pin.
versions/PROMETHEUS_STACK_VERSION	Update Helm chart version pin.
versions/NGT_VERSION	Update NGT version pin.
versions/KUBELINTER_VERSION	Update tool version pin.
versions/KUBECTL_VERSION	Update tool version pin.
versions/K3S_VERSION	Update tool version pin.
versions/HELM_VERSION	Update tool version pin.
versions/HDF5_VERSION	Update library version pin.
versions/GO_VERSION	Update Go version pin.
versions/GOLANGCILINT_VERSION	Update linter version pin.
versions/FAISS_VERSION	Update Faiss version pin.
versions/DOCKER_VERSION	Update Docker version pin.
versions/CMAKE_VERSION	Update CMake version pin.
versions/CHAOS_MESH_VERSION	Update Chaos Mesh version pin.
versions/BUF_VERSION	Update Buf version pin.
rust/rust-toolchain.toml	Bump Rust toolchain channel.
rust/libs/vqueue/Cargo.toml	Bump Rust dependency versions.
rust/libs/observability/Cargo.toml	Bump Rust dependency versions.
rust/libs/kvs/Cargo.toml	Bump Rust dependency versions.
rust/libs/algorithms/qbg/build.rs	Add ThinLTO flag to C++ build step.
rust/libs/algorithms/ngt/build.rs	Add ThinLTO flag to C++ build step.
rust/bin/meta/src/main.rs	Add CLI args + config loading/validation for meta binary.
rust/bin/meta/src/config.rs	New YAML config + validation + observability conversion.
rust/bin/meta/Cargo.toml	Add config/CLI deps and dev-deps for tests.
rust/bin/agent/Cargo.toml	Bump Rust dependency versions.
rust/Cargo.toml	Enable ThinLTO + codegen-units tuning for release builds.
pkg/agent/core/ngt/handler/grpc/insert.go	Prefer Vector.Timestamp over Config.Timestamp for inserts.
pkg/agent/core/ngt/handler/grpc/index_test.go	Update tests to account for timestamp migration.
k8s/operator/helm/crds/valdrelease.yaml	Add `enable_metrics` schema field for connection pool.
k8s/manager/index/deployment.yaml	Update config checksum annotation.
k8s/manager/index/configmap.yaml	Add `enable_metrics` to connection_pool config.
k8s/index/job/save/configmap.yaml	Add `enable_metrics` to connection_pool config.
k8s/index/job/deletion/configmap.yaml	Add `enable_metrics` to connection_pool config.
k8s/index/job/creation/configmap.yaml	Add `enable_metrics` to connection_pool config.
k8s/index/job/correction/configmap.yaml	Add `enable_metrics` to connection_pool config.
k8s/gateway/gateway/mirror/deployment.yaml	Update config checksum annotation.
k8s/gateway/gateway/mirror/configmap.yaml	Add `enable_metrics` to connection_pool config.
k8s/gateway/gateway/lb/deployment.yaml	Update config checksum annotation.
k8s/gateway/gateway/lb/configmap.yaml	Add `enable_metrics` to connection_pool config.
internal/servers/server/server.go	Track listener and close it during Shutdown.
internal/net/grpc/stats/stats_test.go	Make cgroup tests deterministic via temp dirs; avoid parallel.
internal/net/grpc/stats/stats.go	Add proc path override + better cgroup v2 path detection.
internal/net/grpc/server_test.go	Adjust comparison options for unexported fields.
internal/net/grpc/pool/pool_bench_test.go	Add high-contention pool benchmark.
internal/net/grpc/pool/pool_bench_gethealthy_test.go	Add getHealthyConn benchmarks (serial + parallel).
internal/net/grpc/pool/pool.go	Rework pool selection/refresh/health + add metrics toggle.
internal/net/grpc/pool/option.go	Add pool option to enable/disable metrics.
internal/net/grpc/option.go	Change default pool size and add default keepalive params + metrics option.
internal/net/grpc/errdetails/errdetails_test.go	Add comprehensive tests for detail decoding/serialization.
internal/net/grpc/client.go	Add singleflight around connect/disconnect + pool metrics wiring.
internal/net/grpc/benchmark_test.go	New benchmarks for broadcast and RPC execution paths.
internal/core/algorithm/ngt/doc.go	Update NGT upstream link.
internal/core/algorithm/faiss/Capi.cpp	Add missing include (sstream).
internal/config/grpc_test.go	Update expected options due to new connection-pool metrics option.
internal/config/grpc.go	Add connection_pool.enable_metrics config and wire to grpc options.
hack/helm/schema/crd/main.go	Optimize indent function string building.
hack/go.mod.default	Update Go directive version.
hack/docker/gen/main.go	Add clang+lld LTO path and adjust deps/packages.
hack/benchmark/internal/core/algorithm/ngt/ngt.go	Update NGT search call signature.
example/client/go.sum	Update dependency checksums.
example/client/go.mod.default	Update Go directive version.
example/client/go.mod	Update Go directive + dependency versions.
docs/user-guides/configuration.md	Update NGT upstream links.
docs/usecase/usage-example.md	Update NGT upstream links.
docs/tutorial/vald-agent-standalone-on-k8s.md	Update NGT upstream links.
docs/performance/tuning-search-performance.md	Update NGT upstream links in tables.
docs/overview/component/agent.md	Update NGT upstream links.
docs/overview/component/README.md	Update NGT upstream links.
docs/overview/architecture.md	Update NGT upstream links.
docs/overview/about-vald.md	Update NGT upstream links.
dockers/dev/Dockerfile	Swap pkg-config->pkgconf; add pigz.
dockers/agent/core/agent/Dockerfile	Add clang+lld, RUSTFLAGS, and ThinLTO build path.
cmd/tools/cli/benchmark/core/doc.go	Update NGT upstream link.
charts/vald/values.yaml	Add `enable_metrics` field to connection_pool defaults.
charts/vald/values.schema.json	Add schema for `enable_metrics` and update NGT links.
charts/vald-with-tikv/values.yaml	Update NGT links and adjust filename_suffix quoting.
charts/vald-helm-operator/crds/valdrelease.yaml	Add `enable_metrics` schema field for connection pool.
README.md	Update NGT upstream link.
Makefile.d/tools.mk	Fix rustfmt install dependency ordering.
Makefile.d/docker.mk	Build docker images with `--network host`.
Makefile.d/dependencies.mk	Update NGT release lookup repo URL.
Makefile	Update NGT repo and add NGT_EXTRA_CMAKE_FLAGS; run `go fix` in format target.
.gitignore	Ignore generated `/gen` directory.
.github/workflows/update-pull-request-and-issue-template.yaml	Bump action versions.
.github/workflows/update-deps.yaml	Bump action versions.
.github/workflows/reviewdog-markdown.yaml	Exclude an additional README from reviewdog patterns.
.github/workflows/release.yaml	Bump action versions.
.github/workflows/issue-metrics.yaml	Bump action versions.
.github/workflows/helm.yaml	Bump action versions.
.github/workflows/github-actions-workflow-linter.yaml	Bump setup-go action version.
.github/workflows/format.yaml	Bump action versions.
.github/workflows/e2e-profiling.yaml	Bump artifact/cache actions.
.github/workflows/e2e-code-bench-agent.yaml	Bump artifact action version.
.github/workflows/dockers-readreplica-rotate-image.yaml	Add singleflight to docker build trigger paths.
.github/workflows/dockers-manager-index-image.yaml	Add singleflight to docker build trigger paths.
.github/workflows/dockers-index-save-image.yaml	Add singleflight to docker build trigger paths.
.github/workflows/dockers-index-operator-image.yaml	Add singleflight to docker build trigger paths.
.github/workflows/dockers-index-exportation-image.yaml	Add singleflight to docker build trigger paths.
.github/workflows/dockers-index-deletion-image.yaml	Add singleflight to docker build trigger paths.
.github/workflows/dockers-index-creation-image.yaml	Add singleflight to docker build trigger paths.
.github/workflows/dockers-index-correction-image.yaml	Add singleflight to docker build trigger paths.
.github/workflows/dockers-gateway-mirror-image.yaml	Add singleflight to docker build trigger paths.
.github/workflows/dockers-gateway-lb-image.yaml	Add singleflight to docker build trigger paths.
.github/workflows/dockers-gateway-filter-image.yaml	Add singleflight to docker build trigger paths.
.github/workflows/dockers-benchmark-operator-image.yaml	Add singleflight to docker build trigger paths.
.github/workflows/dockers-benchmark-job-image.yaml	Add singleflight to docker build trigger paths.
.github/workflows/dockers-agent-sidecar-image.yaml	Add singleflight to docker build trigger paths.
.github/workflows/dockers-agent-faiss-image.yaml	Add singleflight to docker build trigger paths.
.github/workflows/coverage.yaml	Bump codecov action version.
.github/workflows/codeql-analysis.yml	Bump codeql actions.
.github/workflows/chatops.yaml	Bump gpg import action version.
.github/workflows/build-binaries.yaml	Bump artifact/download actions and upload-release-asset action.
.github/workflows/backport.yaml	Bump gpg import action version.
.github/workflows/_release-pr.yaml	Bump gpg import action version.
.github/workflows/_docker-image.yaml	Bump docker login & buildx actions.
.github/actions/setup-helm/action.yaml	Bump azure/setup-helm action version.
.github/actions/setup-go/action.yaml	Bump actions/setup-go action version.
.github/actions/scan-docker-image/action.yaml	Bump codeql upload-sarif action version.
.github/actions/e2e-profile/action.yaml	Bump artifact/cache actions.
.github/PULL_REQUEST_TEMPLATE.md	Update environment version hints.
.github/ISSUE_TEMPLATE/security_issue_report.md	Update environment version hints.
.github/ISSUE_TEMPLATE/bug_report.md	Update environment version hints.
.gitfiles	Add new benchmark/config paths to file list.

Comments suppressed due to low confidence (9)

internal/servers/server/server.go:1

Shutdown collects listener close errors into rerr but always returns nil, which drops real shutdown failures. Return rerr (and keep it nil when no errors) so callers can detect shutdown problems.
internal/net/grpc/pool/pool.go:1
refreshConn now runs dials asynchronously and explicitly suppresses dial errors (return nil // do not propagate), which can make Connect() succeed even when all connections failed to dial. Consider tracking dial failures (e.g., atomic failure counter / first error capture) and have connect()/singleTargetConnect() return an error when no slot becomes usable within the connect call context.
internal/net/grpc/pool/pool.go:1
TransientFailure is treated as healthy, which can cause IsHealthy() (and callers relying on it) to report a pool as healthy even when RPCs are likely to fail or be backed off. If you want to keep the conn for retry/backoff, keep it open, but consider reporting TransientFailure as unhealthy (or split 'usable' vs 'healthy' semantics).
internal/net/grpc/pool/pool.go:1
This loop iterates up to p.poolSize.Load() but indexes into slots with special-casing when idx >= slen (falling back to the last slot). If poolSize can ever exceed len(slots) during resize/update windows, this will repeatedly operate on the last slot and skew health/refresh behavior. Safer approach: compute limit := min(p.poolSize.Load(), uint64(len(slots))) (or ensure slots length is always == pool size) and iterate only limit elements.
internal/net/grpc/pool/pool_bench_gethealthy_test.go:1
These benchmarks create and connect a pool but never disconnect/close it. Add a cleanup (e.g., defer p.Disconnect(ctx) or equivalent) so repeated benchmark runs don’t leak goroutines/connections and skew results.
internal/net/grpc/pool/pool_bench_test.go:1
BenchmarkPool_HighContention connects a pool but does not disconnect it. Please add cleanup to close connections after the benchmark to avoid leaking resources and contaminating subsequent benchmarks.
Makefile.d/docker.mk:1
Adding --network host to all local Docker builds increases risk (build steps can access host network services) and can reduce build reproducibility. Consider gating this behind an opt-in flag (e.g., DOCKER_BUILD_NETWORK ?= default and only add --network host when explicitly requested).
hack/helm/schema/crd/main.go:1
This returns result.String() twice and slices using a length computed from a separate result.String() call. It’s inefficient and can panic when the accumulated string is empty (len == 0). Store the string once in a local variable and only slice when length > 0 (or avoid slicing by writing conditionally).
internal/net/grpc/pool/pool.go:1
The PR description emphasizes dependency/workflow updates, but this PR also includes substantial behavioral changes to gRPC pooling/client connection logic (async refresh, health semantics, defaults, metrics). Consider updating the PR description to explicitly call out these runtime behavior changes (and any rollout/compatibility notes) so reviewers/operators don’t miss them.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

kpango · 2026-03-31T19:36:07Z

/format

vdaas-ci · 2026-03-31T19:39:06Z

[FORMAT] Updating license headers and formatting go codes triggered by kpango.

kpango · 2026-03-31T19:43:38Z

/format

vdaas-ci · 2026-03-31T19:45:58Z

[FORMAT] Updating license headers and formatting go codes triggered by kpango.

Copilot

Pull request overview

Copilot reviewed 143 out of 166 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (11)

internal/servers/server/server.go:1

Shutdown accumulates errors into rerr but always returns nil. This will hide listener-close failures (and any joined errors). Return rerr instead of nil so callers can observe shutdown errors.
internal/net/grpc/pool/pool.go:1
refreshConn swallows dial errors by returning nil from the goroutine even when p.dial fails and there is no healthy existing connection. This can cause Connect() to report success while the pool has no usable connections. Consider propagating dial errors (or at least returning an error when all refreshes fail / no healthy conns exist) so callers can react appropriately.
internal/net/grpc/pool/pool.go:1
refreshConn swallows dial errors by returning nil from the goroutine even when p.dial fails and there is no healthy existing connection. This can cause Connect() to report success while the pool has no usable connections. Consider propagating dial errors (or at least returning an error when all refreshes fail / no healthy conns exist) so callers can react appropriately.
internal/net/grpc/pool/pool.go:1
TransientFailure is treated as healthy here. In gRPC, TransientFailure generally indicates the connection is not ready for RPCs and will often immediately fail calls, which can increase observed latency. If the intent is to keep a failing conn around for backoff retries, consider distinguishing 'keep for retry' vs 'healthy for use': return false for TransientFailure so Get/IsHealthy can avoid selecting it while still optionally triggering refresh/reconnect logic.
internal/net/grpc/stats/stats.go:1
subPath parsed from /proc/self/cgroup typically starts with / (e.g. /system.slice/...). If file.Join follows filepath.Join semantics, joining cgroupBasePath with an absolute subPath will discard cgroupBasePath and produce the wrong candidate path. Trim the leading slash (e.g. strings.TrimPrefix(subPath, \"/\")) before joining so the candidate resolves under cgroupBasePath.
hack/helm/schema/crd/main.go:1
This implementation repeatedly calls indent.String() and also calls result.String() twice on the return line, which creates avoidable allocations. Also, result.String()[:len(...)-1] is fragile and harder to read. Prefer computing indentStr := indent.String() once, and compute out := result.String() once before slicing/returning.
hack/helm/schema/crd/main.go:1
This implementation repeatedly calls indent.String() and also calls result.String() twice on the return line, which creates avoidable allocations. Also, result.String()[:len(...)-1] is fragile and harder to read. Prefer computing indentStr := indent.String() once, and compute out := result.String() once before slicing/returning.
hack/docker/gen/main.go:1
The RUSTFLAGS value here includes literal quotes (\"...\"). When exported into the container environment, those quotes may become part of the variable value and break rustc flag parsing. Store the raw flags without wrapping quotes so the environment variable is exactly -Clinker=clang -Clink-arg=-fuse-ld=lld.
Makefile.d/docker.mk:1
Adding --network host reduces build isolation (the Dockerfile build steps can access host services and networks) and can behave differently across environments/runtimes. Consider making this conditional (e.g., behind a Make variable defaulting to off) so users/CI can opt in only when necessary.
Makefile:1
go fix can perform code rewrites (not just formatting) and may introduce unexpected diffs when developers run make format/go. Consider moving go fix to a separate explicit target (or gating it behind a variable) to keep formatting deterministic and narrowly scoped.
Makefile:1
go fix can perform code rewrites (not just formatting) and may introduce unexpected diffs when developers run make format/go. Consider moving go fix to a separate explicit target (or gating it behind a variable) to keep formatting deterministic and narrowly scoped.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

kpango · 2026-04-02T03:48:26Z

/format

vdaas-ci · 2026-04-02T03:50:59Z

[FORMAT] Updating license headers and formatting go codes triggered by kpango.

Signed-off-by: kpango <kpango@vdaas.org>

github-actions · 2026-04-02T04:11:22Z

docs/performance/tuning-search-performance.md

+|                              | description                                                                                                                                                | how does it affect?                                                                                                                                                                        | Memo                |
+| :--------------------------- | :--------------------------------------------------------------------------------------------------------------------------------------------------------- | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :------------------ |
+| agent.ngt.creation_edge_size | Number of nodes connected to one node<BR>[ref: no_of_edges](https://github.com/NGT-labs/NGT/tree/main/bin/ngt#create)                                      | It helps reduce unreachable edges.<BR>The larger it is, the denser the graph structure will be, but the memory usage, search speed, and index construction time will increase accordingly. | default value: `20` |
+| agent.ngt.search_edge_size   | Number of nodes to search from the origin node when searching<BR>[ref: no_of_edges_at_search_at](https://github.com/NGT-labs/NGT/tree/main/bin/ngt#create) | The number of nodes to search will increase.<BR>Accuracy will be higher, but speed will be lower.<BR>Adjust if adjusting the radius and epsilon does not improve the situation.            | default value: `10` |


[LanguageTool] _{reported by reviewdog 🐶}
Did you mean “lowed”, “lowered”? (MUST_BE_DO[2])
Suggestions: lowed, lowered
Rule: https://community.languagetool.org/rule/show/MUST_BE_DO?lang=en-US&subId=2
Category: GRAMMAR

github-actions bot added area/internal area/gateway/lb actions/backport/release/v1.7 actions/build/amd64 Build AMD64 Docker Images size/L priority/low labels Feb 18, 2026

coderabbitai bot reviewed Feb 18, 2026

View reviewed changes

internal/net/grpc/pool/pool_bench_test.go Show resolved Hide resolved

coderabbitai bot reviewed Feb 18, 2026

View reviewed changes

kpango force-pushed the fix-grpc-broadcast-latency-17105619334351837761 branch 2 times, most recently from 7e7dd4a to 6fd320d Compare February 19, 2026 10:02

coderabbitai bot reviewed Feb 19, 2026

View reviewed changes

kpango force-pushed the fix-grpc-broadcast-latency-17105619334351837761 branch 3 times, most recently from 4a0c245 to 9a7f31a Compare February 19, 2026 10:17

kpango force-pushed the fix-grpc-broadcast-latency-17105619334351837761 branch from 9a7f31a to c1f0a9f Compare February 19, 2026 10:22

coderabbitai bot reviewed Feb 19, 2026

View reviewed changes

internal/net/grpc/benchmark_test.go Outdated Show resolved Hide resolved

internal/net/grpc/benchmark_test.go Show resolved Hide resolved

internal/net/grpc/client.go Outdated Show resolved Hide resolved

internal/net/grpc/pool/pool_bench_test.go Show resolved Hide resolved

github-actions bot added type/ci size/XL and removed size/L labels Feb 19, 2026

kpango force-pushed the fix-grpc-broadcast-latency-17105619334351837761 branch from b1a485a to 59b91f6 Compare February 19, 2026 21:56

Copilot AI mentioned this pull request Mar 16, 2026

Add pigz to vald-dev-container and apply yamlfmt formatting #3503

Merged

kpango force-pushed the fix-grpc-broadcast-latency-17105619334351837761 branch 5 times, most recently from 6582777 to a2a25af Compare March 23, 2026 18:53

github-actions bot reviewed Mar 23, 2026

View reviewed changes

docs/performance/tuning-search-performance.md Show resolved Hide resolved

docs/performance/tuning-search-performance.md Show resolved Hide resolved

kpango force-pushed the fix-grpc-broadcast-latency-17105619334351837761 branch from a2a25af to f5bc302 Compare March 24, 2026 03:13

github-actions bot added the size/XXXL label Mar 24, 2026

Copilot AI reviewed Mar 31, 2026

View reviewed changes

This was referenced Apr 1, 2026

fix: remove -flto=auto from LDFLAGS to fix agent-ngt segfault during linking #3513

Closed

fix: add -Wl,--no-keep-memory to EXTLDFLAGS to prevent linker OOM segfault in agent-ngt build #3516

Merged

resolve broadcast concurrency and tune connection pool

1fade9f

Signed-off-by: kpango <kpango@vdaas.org>

github-actions bot reviewed Apr 2, 2026

View reviewed changes

Conversation

kpango commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

google-labs-jules bot commented Feb 18, 2026

Uh oh!

cloudflare-workers-and-pages bot commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying vald with Cloudflare Pages

Uh oh!

vdaas-ci commented Feb 18, 2026

Uh oh!

coderabbitai bot commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

codecov bot commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kpango commented Feb 19, 2026

Uh oh!

coderabbitai bot commented Feb 19, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kmrmt commented Feb 19, 2026

Uh oh!

vdaas-ci commented Feb 19, 2026

Uh oh!

kpango commented Mar 16, 2026

1. Create a Dockerfile for your CI container

2. Update your GitHub Actions workflow

3. Build and push your CI container to GitHub Container Registry

kpango commented Feb 18, 2026 •

edited

Loading

cloudflare-workers-and-pages bot commented Feb 18, 2026 •

edited

Loading

coderabbitai bot commented Feb 18, 2026 •

edited

Loading

codecov bot commented Feb 18, 2026 •

edited

Loading

CLAassistant commented Mar 16, 2026 •

edited

Loading