fix(chart): restore workflow-run NetworkPolicy rules for Postgres access#553
Draft
0xdiid wants to merge 1 commit into
Draft
fix(chart): restore workflow-run NetworkPolicy rules for Postgres access#5530xdiid wants to merge 1 commit into
0xdiid wants to merge 1 commit into
Conversation
paradigmxyz#521 (remove legacy api and slackbot) rewrote networkpolicy.yaml and, as collateral, dropped every reference to the python workflow-host sandboxes (centaur.ai/component=workflow-run) — both the postgres ingress from-entry and the standalone workflow-run egress policy. These were not legacy: api-rs still labels its workflow-host sandboxes workflow-run (args.rs) and injects DATABASE_URL so the host dials Postgres directly, and the operate docs still describe the netpol admitting workflow-run pods. On any cluster whose CNI enforces NetworkPolicy (k3s, Calico, Cilium; kind's default CNI does not, which is why CI didn't catch it) every workflow run now fails at startup with ECONNREFUSED to Postgres — the default-deny blocks the host's egress and Postgres ingress no longer admits it. Restore the two rules, adapted to the post-paradigmxyz#521 architecture: - postgres ingress: admit centaur.ai/component=workflow-run (gated on apiRs). - a workflow-run egress policy: Postgres:5432 + direct 443. The per-sandbox egress policy api-rs renders already covers the host's proxy/control-plane/ DNS egress, so the dropped legacy api:8000 / slackbot:3001 / sandbox-id=api proxy peers are intentionally not reintroduced. Both are gated on apiRs.enabled (workflow hosts only exist there); the egress Postgres rule is additionally gated on postgres.enabled. Verified with helm template across apiRs/postgres enable combinations and helm lint.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Heads-up: opening as a draft for maintainer review — flagging a NetworkPolicy regression I hit running the chart on k3s.
Summary
#521 ("remove legacy api and slackbot") rewrote
contrib/chart/templates/networkpolicy.yamland, as collateral, dropped every reference to the python workflow-host sandboxes (centaur.ai/component=workflow-run): both the Postgres-ingressfrom-entry and the standaloneworkflow-runegress policy.These are not legacy plumbing — they are load-bearing for the current api-rs workflow engine:
services/api-rs/.../args.rs(workflow_host_spec) still labels every workflow-host sandboxcentaur.ai/component=workflow-runand injectsDATABASE_URL, so the host process connects to Postgres directly (its tool calls go through the per-sandbox iron-proxy; the hosts own DB pool does not).docs/.../operate/tailscale-funnelstill says the NetworkPolicy "admits only the API, workflow-run pods, and the listed..." — describing the behavior chore: remove legacy api and slackbot #521 removed.Impact
On any cluster whose CNI enforces NetworkPolicy (k3s, Calico, Cilium), every workflow run now fails at host startup with:
The
default-denypolicy blocks the hosts egress to Postgres, and the Postgres ingress no longer admits it. kinds default CNI does not enforce NetworkPolicy, which is most likely why CI didnt catch it. Symptom in the wild: webhooks are accepted (202) and runs are created, but none ever execute — they pile uppendingwhile the claimed ones hangrunningand clog the worker slots. No acks, no comments, nothing.Fix
Restore the two rules, adapted to the post-#521 architecture (no resurrection of the removed legacy components):
centaur.ai/component=workflow-run(gated onapiRs.enabled).workflow-runegress policy — Postgres:5432+ direct:443. The per-sandbox egress policy api-rs renders dynamically (build_iron_proxy_network_policies) already covers the hosts own proxy, the control plane, and DNS, so the dropped legacyapi:8000/slackbot:3001/sandbox-id=apiproxy peers are intentionally not reintroduced — only the direct-Postgres and direct-HTTPS egress that the dynamic policy does not grant.Both gated on
apiRs.enabled(workflow hosts only exist there); the egress Postgres rule additionally onpostgres.enabled.Testing
helm templateacrossapiRs.enabled×postgres.enabledcombinations: rules render only whenapiRs.enabled=true; the egress pg rule drops cleanly whenpostgres.enabled=false.helm lintpasses.Happy to adjust — in particular whether the direct
:443egress is still wanted (it mirrors the old policy and the api-rs egress, but most workflow traffic now routes through the proxy).