Skip to content

fix(chart): restore workflow-run NetworkPolicy rules for Postgres access#553

Draft
0xdiid wants to merge 1 commit into
paradigmxyz:mainfrom
0xSplits:fix/workflow-run-postgres-netpol
Draft

fix(chart): restore workflow-run NetworkPolicy rules for Postgres access#553
0xdiid wants to merge 1 commit into
paradigmxyz:mainfrom
0xSplits:fix/workflow-run-postgres-netpol

Conversation

@0xdiid

@0xdiid 0xdiid commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Heads-up: opening as a draft for maintainer review — flagging a NetworkPolicy regression I hit running the chart on k3s.

Summary

#521 ("remove legacy api and slackbot") rewrote contrib/chart/templates/networkpolicy.yaml and, as collateral, dropped every reference to the python workflow-host sandboxes (centaur.ai/component=workflow-run): both the Postgres-ingress from-entry and the standalone workflow-run egress policy.

These are not legacy plumbing — they are load-bearing for the current api-rs workflow engine:

  • services/api-rs/.../args.rs (workflow_host_spec) still labels every workflow-host sandbox centaur.ai/component=workflow-run and injects DATABASE_URL, so the host process connects to Postgres directly (its tool calls go through the per-sandbox iron-proxy; the hosts own DB pool does not).
  • docs/.../operate/tailscale-funnel still says the NetworkPolicy "admits only the API, workflow-run pods, and the listed..." — describing the behavior chore: remove legacy api and slackbot #521 removed.

Impact

On any cluster whose CNI enforces NetworkPolicy (k3s, Calico, Cilium), every workflow run now fails at host startup with:

ConnectionRefusedError: [Errno 111] Connect call failed (<postgres-svc>, 5432)

The default-deny policy blocks the hosts egress to Postgres, and the Postgres ingress no longer admits it. kinds default CNI does not enforce NetworkPolicy, which is most likely why CI didnt catch it. Symptom in the wild: webhooks are accepted (202) and runs are created, but none ever execute — they pile up pending while the claimed ones hang running and clog the worker slots. No acks, no comments, nothing.

Fix

Restore the two rules, adapted to the post-#521 architecture (no resurrection of the removed legacy components):

  1. Postgres ingress — admit centaur.ai/component=workflow-run (gated on apiRs.enabled).
  2. workflow-run egress policy — Postgres :5432 + direct :443. The per-sandbox egress policy api-rs renders dynamically (build_iron_proxy_network_policies) already covers the hosts own proxy, the control plane, and DNS, so the dropped legacy api:8000 / slackbot:3001 / sandbox-id=api proxy peers are intentionally not reintroduced — only the direct-Postgres and direct-HTTPS egress that the dynamic policy does not grant.

Both gated on apiRs.enabled (workflow hosts only exist there); the egress Postgres rule additionally on postgres.enabled.

Testing

  • helm template across apiRs.enabled × postgres.enabled combinations: rules render only when apiRs.enabled=true; the egress pg rule drops cleanly when postgres.enabled=false.
  • helm lint passes.
  • Verified live on a k3s cluster: applying the restored rules immediately unblocked Postgres connectivity and a ~200-run backlog drained with zero failures.

Happy to adjust — in particular whether the direct :443 egress is still wanted (it mirrors the old policy and the api-rs egress, but most workflow traffic now routes through the proxy).

paradigmxyz#521 (remove legacy api and slackbot) rewrote networkpolicy.yaml and, as
collateral, dropped every reference to the python workflow-host sandboxes
(centaur.ai/component=workflow-run) — both the postgres ingress from-entry
and the standalone workflow-run egress policy. These were not legacy: api-rs
still labels its workflow-host sandboxes workflow-run (args.rs) and injects
DATABASE_URL so the host dials Postgres directly, and the operate docs still
describe the netpol admitting workflow-run pods.

On any cluster whose CNI enforces NetworkPolicy (k3s, Calico, Cilium; kind's
default CNI does not, which is why CI didn't catch it) every workflow run now
fails at startup with ECONNREFUSED to Postgres — the default-deny blocks the
host's egress and Postgres ingress no longer admits it.

Restore the two rules, adapted to the post-paradigmxyz#521 architecture:
- postgres ingress: admit centaur.ai/component=workflow-run (gated on apiRs).
- a workflow-run egress policy: Postgres:5432 + direct 443. The per-sandbox
  egress policy api-rs renders already covers the host's proxy/control-plane/
  DNS egress, so the dropped legacy api:8000 / slackbot:3001 / sandbox-id=api
  proxy peers are intentionally not reintroduced.

Both are gated on apiRs.enabled (workflow hosts only exist there); the egress
Postgres rule is additionally gated on postgres.enabled. Verified with helm
template across apiRs/postgres enable combinations and helm lint.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant