Skip to content

fix(helmcharts/oracle): probe URL ?oracle_only=true -> ?http_only=true + timeout 1->3#167

Merged
sam-at-luther merged 1 commit into
mainfrom
fix/oracle-probe-http-only
May 25, 2026
Merged

fix(helmcharts/oracle): probe URL ?oracle_only=true -> ?http_only=true + timeout 1->3#167
sam-at-luther merged 1 commit into
mainfrom
fix/oracle-probe-http-only

Conversation

@sam-at-luther
Copy link
Copy Markdown
Member

Problem

The oracle chart's liveness/readiness probes pass ?oracle_only=true to /v1/<app>/health_check. That query parameter does not exist on the proto — the field is http_only.

  • Proto: buf.build/.../healthcheck/v1/healthcheck.pb.go:80HttpOnly bool, no oracle_only field.
  • Handler gate: svc/oracle/oracle.go:302if !req.GetHttpOnly() { phylumHealthCheck(...) } — gate works, just never received the value.

grpc-gateway silently drops unknown query params, so the oracle handler ran the full health check including the ~1s shiroclient phylumHealthCheck call.

Symptom (staging, 2026-05-24)

6 umbrella-oracle restarts in 9 hours, ~every 90 minutes. Pod logs show:

rpc_dur=1.007s rpc_method=/srvpb.v1.UIService/GetHealthCheck
no phylum version found

The rpc_dur=1.007s blows the default 1s probe timeout. The no phylum version found line proves shiroclient was actually being called (i.e. ?oracle_only=true was a no-op). Kubelet marks pod unhealthy → restart → repeat.

Fix

  1. Send the correct query param: ?http_only=true so the handler short-circuits past phylumHealthCheck.
  2. Bump probe timeoutSeconds from the implicit 1s default to 3s — defence in depth so a GC pause doesn't bounce the pod even on the fast path.
  3. Rename the values toggle oracleOnlyProbe -> httpOnlyProbe so chart and proto field share a name.

Diff

  • templates/deployment.yaml: ?oracle_only=true -> ?http_only=true on both probes; add timeoutSeconds: 3 to both; rename template var.
  • values.yaml: rename oracleOnlyProbe: true -> httpOnlyProbe: true.

grep -rn "oracleOnlyProbe\|oracle_only" . is now clean across the whole repo — only these two files referenced the old name.

Rollout

After merge + new mars tag, ui-infrastructure bumps .mars-version to pick this up. Tracking in ui-infrastructure#292.

… + timeout 1→3

The kubelet liveness/readiness probes were hitting the oracle health
endpoint with `?oracle_only=true`, but that query parameter does not
exist on the proto — the field is `http_only` (see
buf.build/.../healthcheck/v1/healthcheck.pb.go:80 → `HttpOnly bool`,
no `oracle_only` field).

grpc-gateway silently drops unknown query params, so the oracle
handler received `http_only=false` and ran the full health check,
including the ~1s shiroclient `phylumHealthCheck` call (handler gate
at svc/oracle/oracle.go:302: `if !req.GetHttpOnly() { phylumHealthCheck(...) }`
— gate works correctly, just never received the value).

Symptom on staging 2026-05-24: 6 umbrella-oracle restarts in 9h, every
~90m. Pod logs show the smoking gun:

  rpc_dur=1.007s rpc_method=/srvpb.v1.UIService/GetHealthCheck
  no phylum version found    # proves shiroclient was actually called

That 1.007s comfortably blows the default 1s probe timeout, kubelet
marks the pod unhealthy, restarts. Repeat.

Fix:
- Send the correct query param `?http_only=true` so the handler short-
  circuits to the in-process health check.
- Bump probe `timeoutSeconds` from the implicit 1s default to 3s as
  defence in depth — even the in-process path does some work, and a
  single GC pause shouldn't bounce the pod.
- Rename the values toggle `oracleOnlyProbe` → `httpOnlyProbe` so the
  chart and the proto field share a name.
@sam-at-luther sam-at-luther merged commit 0420bdd into main May 25, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant