fix(chart): restore workflow-run NetworkPolicy rules for Postgres access by 0xdiid · Pull Request #553 · paradigmxyz/centaur

0xdiid · 2026-06-13T02:54:28Z

Heads-up: opening as a draft for maintainer review — flagging a NetworkPolicy regression I hit running the chart on k3s.

Summary

#521 ("remove legacy api and slackbot") rewrote contrib/chart/templates/networkpolicy.yaml and, as collateral, dropped every reference to the python workflow-host sandboxes (centaur.ai/component=workflow-run): both the Postgres-ingress from-entry and the standalone workflow-run egress policy.

These are not legacy plumbing — they are load-bearing for the current api-rs workflow engine:

services/api-rs/.../args.rs (workflow_host_spec) still labels every workflow-host sandbox centaur.ai/component=workflow-run and injects DATABASE_URL, so the host process connects to Postgres directly (its tool calls go through the per-sandbox iron-proxy; the hosts own DB pool does not).
docs/.../operate/tailscale-funnel still says the NetworkPolicy "admits only the API, workflow-run pods, and the listed..." — describing the behavior chore: remove legacy api and slackbot #521 removed.

Impact

On any cluster whose CNI enforces NetworkPolicy (k3s, Calico, Cilium), every workflow run now fails at host startup with:

ConnectionRefusedError: [Errno 111] Connect call failed (<postgres-svc>, 5432)

The default-deny policy blocks the hosts egress to Postgres, and the Postgres ingress no longer admits it. kinds default CNI does not enforce NetworkPolicy, which is most likely why CI didnt catch it. Symptom in the wild: webhooks are accepted (202) and runs are created, but none ever execute — they pile up pending while the claimed ones hang running and clog the worker slots. No acks, no comments, nothing.

Fix

Restore the two rules, adapted to the post-#521 architecture (no resurrection of the removed legacy components):

Postgres ingress — admit centaur.ai/component=workflow-run (gated on apiRs.enabled).
workflow-run egress policy — Postgres :5432 + direct :443. The per-sandbox egress policy api-rs renders dynamically (build_iron_proxy_network_policies) already covers the hosts own proxy, the control plane, and DNS, so the dropped legacy api:8000 / slackbot:3001 / sandbox-id=api proxy peers are intentionally not reintroduced — only the direct-Postgres and direct-HTTPS egress that the dynamic policy does not grant.

Both gated on apiRs.enabled (workflow hosts only exist there); the egress Postgres rule additionally on postgres.enabled.

Testing

helm template across apiRs.enabled × postgres.enabled combinations: rules render only when apiRs.enabled=true; the egress pg rule drops cleanly when postgres.enabled=false.
helm lint passes.
Verified live on a k3s cluster: applying the restored rules immediately unblocked Postgres connectivity and a ~200-run backlog drained with zero failures.

Happy to adjust — in particular whether the direct :443 egress is still wanted (it mirrors the old policy and the api-rs egress, but most workflow traffic now routes through the proxy).

paradigmxyz#521 (remove legacy api and slackbot) rewrote networkpolicy.yaml and, as collateral, dropped every reference to the python workflow-host sandboxes (centaur.ai/component=workflow-run) — both the postgres ingress from-entry and the standalone workflow-run egress policy. These were not legacy: api-rs still labels its workflow-host sandboxes workflow-run (args.rs) and injects DATABASE_URL so the host dials Postgres directly, and the operate docs still describe the netpol admitting workflow-run pods. On any cluster whose CNI enforces NetworkPolicy (k3s, Calico, Cilium; kind's default CNI does not, which is why CI didn't catch it) every workflow run now fails at startup with ECONNREFUSED to Postgres — the default-deny blocks the host's egress and Postgres ingress no longer admits it. Restore the two rules, adapted to the post-paradigmxyz#521 architecture: - postgres ingress: admit centaur.ai/component=workflow-run (gated on apiRs). - a workflow-run egress policy: Postgres:5432 + direct 443. The per-sandbox egress policy api-rs renders already covers the host's proxy/control-plane/ DNS egress, so the dropped legacy api:8000 / slackbot:3001 / sandbox-id=api proxy peers are intentionally not reintroduced. Both are gated on apiRs.enabled (workflow hosts only exist there); the egress Postgres rule is additionally gated on postgres.enabled. Verified with helm template across apiRs/postgres enable combinations and helm lint.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(chart): restore workflow-run NetworkPolicy rules for Postgres access#553

fix(chart): restore workflow-run NetworkPolicy rules for Postgres access#553
0xdiid wants to merge 1 commit into
paradigmxyz:mainfrom
0xSplits:fix/workflow-run-postgres-netpol

0xdiid commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

0xdiid commented Jun 13, 2026

Summary

Impact

Fix

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant