fix(helmcharts/oracle): probe URL ?oracle_only=true -> ?http_only=true + timeout 1->3#167
Merged
Merged
Conversation
… + timeout 1→3
The kubelet liveness/readiness probes were hitting the oracle health
endpoint with `?oracle_only=true`, but that query parameter does not
exist on the proto — the field is `http_only` (see
buf.build/.../healthcheck/v1/healthcheck.pb.go:80 → `HttpOnly bool`,
no `oracle_only` field).
grpc-gateway silently drops unknown query params, so the oracle
handler received `http_only=false` and ran the full health check,
including the ~1s shiroclient `phylumHealthCheck` call (handler gate
at svc/oracle/oracle.go:302: `if !req.GetHttpOnly() { phylumHealthCheck(...) }`
— gate works correctly, just never received the value).
Symptom on staging 2026-05-24: 6 umbrella-oracle restarts in 9h, every
~90m. Pod logs show the smoking gun:
rpc_dur=1.007s rpc_method=/srvpb.v1.UIService/GetHealthCheck
no phylum version found # proves shiroclient was actually called
That 1.007s comfortably blows the default 1s probe timeout, kubelet
marks the pod unhealthy, restarts. Repeat.
Fix:
- Send the correct query param `?http_only=true` so the handler short-
circuits to the in-process health check.
- Bump probe `timeoutSeconds` from the implicit 1s default to 3s as
defence in depth — even the in-process path does some work, and a
single GC pause shouldn't bounce the pod.
- Rename the values toggle `oracleOnlyProbe` → `httpOnlyProbe` so the
chart and the proto field share a name.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The oracle chart's liveness/readiness probes pass
?oracle_only=trueto/v1/<app>/health_check. That query parameter does not exist on the proto — the field ishttp_only.buf.build/.../healthcheck/v1/healthcheck.pb.go:80—HttpOnly bool, nooracle_onlyfield.svc/oracle/oracle.go:302—if !req.GetHttpOnly() { phylumHealthCheck(...) }— gate works, just never received the value.grpc-gateway silently drops unknown query params, so the oracle handler ran the full health check including the ~1s shiroclient
phylumHealthCheckcall.Symptom (staging, 2026-05-24)
6 umbrella-oracle restarts in 9 hours, ~every 90 minutes. Pod logs show:
The
rpc_dur=1.007sblows the default 1s probe timeout. Theno phylum version foundline proves shiroclient was actually being called (i.e.?oracle_only=truewas a no-op). Kubelet marks pod unhealthy → restart → repeat.Fix
?http_only=trueso the handler short-circuits pastphylumHealthCheck.timeoutSecondsfrom the implicit 1s default to 3s — defence in depth so a GC pause doesn't bounce the pod even on the fast path.oracleOnlyProbe->httpOnlyProbeso chart and proto field share a name.Diff
templates/deployment.yaml:?oracle_only=true->?http_only=trueon both probes; addtimeoutSeconds: 3to both; rename template var.values.yaml: renameoracleOnlyProbe: true->httpOnlyProbe: true.grep -rn "oracleOnlyProbe\|oracle_only" .is now clean across the whole repo — only these two files referenced the old name.Rollout
After merge + new mars tag, ui-infrastructure bumps
.mars-versionto pick this up. Tracking in ui-infrastructure#292.