fix(helmcharts/oracle): probe URL ?oracle_only=true -> ?http_only=true + timeout 1->3 by sam-at-luther · Pull Request #167 · luthersystems/mars

sam-at-luther · 2026-05-25T01:43:34Z

Problem

The oracle chart's liveness/readiness probes pass ?oracle_only=true to /v1/<app>/health_check. That query parameter does not exist on the proto — the field is http_only.

Proto: buf.build/.../healthcheck/v1/healthcheck.pb.go:80 — HttpOnly bool, no oracle_only field.
Handler gate: svc/oracle/oracle.go:302 — if !req.GetHttpOnly() { phylumHealthCheck(...) } — gate works, just never received the value.

grpc-gateway silently drops unknown query params, so the oracle handler ran the full health check including the ~1s shiroclient phylumHealthCheck call.

Symptom (staging, 2026-05-24)

6 umbrella-oracle restarts in 9 hours, ~every 90 minutes. Pod logs show:

rpc_dur=1.007s rpc_method=/srvpb.v1.UIService/GetHealthCheck
no phylum version found

The rpc_dur=1.007s blows the default 1s probe timeout. The no phylum version found line proves shiroclient was actually being called (i.e. ?oracle_only=true was a no-op). Kubelet marks pod unhealthy → restart → repeat.

Fix

Send the correct query param: ?http_only=true so the handler short-circuits past phylumHealthCheck.
Bump probe timeoutSeconds from the implicit 1s default to 3s — defence in depth so a GC pause doesn't bounce the pod even on the fast path.
Rename the values toggle oracleOnlyProbe -> httpOnlyProbe so chart and proto field share a name.

Diff

templates/deployment.yaml: ?oracle_only=true -> ?http_only=true on both probes; add timeoutSeconds: 3 to both; rename template var.
values.yaml: rename oracleOnlyProbe: true -> httpOnlyProbe: true.

grep -rn "oracleOnlyProbe\|oracle_only" . is now clean across the whole repo — only these two files referenced the old name.

Rollout

After merge + new mars tag, ui-infrastructure bumps .mars-version to pick this up. Tracking in ui-infrastructure#292.

… + timeout 1→3 The kubelet liveness/readiness probes were hitting the oracle health endpoint with `?oracle_only=true`, but that query parameter does not exist on the proto — the field is `http_only` (see buf.build/.../healthcheck/v1/healthcheck.pb.go:80 → `HttpOnly bool`, no `oracle_only` field). grpc-gateway silently drops unknown query params, so the oracle handler received `http_only=false` and ran the full health check, including the ~1s shiroclient `phylumHealthCheck` call (handler gate at svc/oracle/oracle.go:302: `if !req.GetHttpOnly() { phylumHealthCheck(...) }` — gate works correctly, just never received the value). Symptom on staging 2026-05-24: 6 umbrella-oracle restarts in 9h, every ~90m. Pod logs show the smoking gun: rpc_dur=1.007s rpc_method=/srvpb.v1.UIService/GetHealthCheck no phylum version found # proves shiroclient was actually called That 1.007s comfortably blows the default 1s probe timeout, kubelet marks the pod unhealthy, restarts. Repeat. Fix: - Send the correct query param `?http_only=true` so the handler short- circuits to the in-process health check. - Bump probe `timeoutSeconds` from the implicit 1s default to 3s as defence in depth — even the in-process path does some work, and a single GC pause shouldn't bounce the pod. - Rename the values toggle `oracleOnlyProbe` → `httpOnlyProbe` so the chart and the proto field share a name.

sam-at-luther merged commit 0420bdd into main May 25, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(helmcharts/oracle): probe URL ?oracle_only=true -> ?http_only=true + timeout 1->3#167

fix(helmcharts/oracle): probe URL ?oracle_only=true -> ?http_only=true + timeout 1->3#167
sam-at-luther merged 1 commit into
mainfrom
fix/oracle-probe-http-only

sam-at-luther commented May 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sam-at-luther commented May 25, 2026

Problem

Symptom (staging, 2026-05-24)

Fix

Diff

Rollout

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant