Skip to content

[ocp4_workload_rhacs] Pre-create CNAME to fix ACME DNS ENT issue on DDNS clusters#146

Open
prakhar1985 wants to merge 2 commits into
mainfrom
fix-rhacs-ddns-cname-pre-cert
Open

[ocp4_workload_rhacs] Pre-create CNAME to fix ACME DNS ENT issue on DDNS clusters#146
prakhar1985 wants to merge 2 commits into
mainfrom
fix-rhacs-ddns-cname-pre-cert

Conversation

@prakhar1985
Copy link
Copy Markdown
Contributor

@prakhar1985 prakhar1985 commented May 3, 2026

Problem

On DDNS clusters (dyn.redhatworkshops.io), the central route always ends up with a self-signed certificate despite having ACME ClusterIssuers available (Google CA, ZeroSSL).

Root cause — DNS Empty Non-Terminal (ENT):

When cert-manager runs a DNS-01 ACME challenge for central-stackrox.apps.cluster-<guid>.dyn.redhatworkshops.io, it creates:

_acme-challenge.central-stackrox.apps.cluster-<guid>.dyn.redhatworkshops.io  TXT  "token"

This TXT record creates an ENT at central-stackrox.apps.cluster-<guid> in the DNS tree. Because the name now "exists", the wildcard *.apps.cluster-<guid> no longer matches it — the hostname returns NODATA instead of the router IP. The ACME challenge fails for all issuers (Google and ZeroSSL both use the same DDNS solver), and cert issuance falls back to selfsigned.

Fix

New task file dns_cname_pre_cert.yml pre-creates a specific CNAME record for the central route before cert issuance:

central-stackrox.apps.cluster-<guid>  CNAME  console-openshift-console.apps.cluster-<guid>.dyn.redhatworkshops.io.

The CNAME and the _acme-challenge TXT record coexist as siblings in the DNS zone — no ENT problem. The ACME challenge resolves correctly and a trusted certificate is issued.

Called from workload.yml immediately before certificate.yml.

Changes

roles/ocp4_workload_rhacs/tasks/dns_cname_pre_cert.yml (new file)

  • Checks cluster infrastructure platform — skips entirely on BareMetal and None (SNO) to avoid conflicting with the A record that dns_registration.yml creates after Central is deployed
  • Finds any ClusterIssuer with a DDNS webhook using json_query — no hardcoded issuer names
  • Silently skips if no DDNS-capable ClusterIssuer is found (AWS Route53, GCP, Azure)
  • Reads all TSIG config (tsigAlgorithm, tsigKeyName, tsigSecretRef, ddnsServer, ddnsZone) directly from the ClusterIssuer — nothing hardcoded
  • Creates the CNAME using community.general.nsupdate

roles/ocp4_workload_rhacs/tasks/workload.yml (modified)

  • Added 4 lines to include dns_cname_pre_cert.yml immediately before certificate.yml, gated on ocp4_workload_rhacs_enable_route_certs

Platform behaviour

Platform Result
CNV / OcpSandbox CNAME pre-created — ACME succeeds
AWS No DDNS issuer found → block skipped
GCP / Azure No DDNS issuer found → block skipped
BareMetal / SNO Platform guard skips — dns_registration.yml creates A record after deploy

BareMetal clusters are explicitly excluded because dns_registration.yml already creates an A record for the same hostname after Central is deployed. A CNAME and an A record cannot coexist for the same name (RFC 1034).

Notes

  • dns_registration.yml is unchanged — BareMetal A record logic unaffected
  • Alternative to PR [ocp4_workload_rhacs] Add specific DNS for central route #136 — same problem solved, keeps DNS logic out of certificate.yml
  • TSIG creds are per-cluster not per-CA: both acme-bifrost-production-ddns and acme-bifrost-production-ddns-fallback share the same DDNS server, zone, and cert-manager-tsig-creds secret — the CNAME benefits whichever issuer runs the ACME challenge

Test plan

  • Provision an RHACS catalog item on a CNV/DDNS cluster — verify central route gets a trusted (non-selfsigned) cert
  • Provision on an AWS cluster — verify the CNAME task skips silently and provisioning completes normally
  • Provision on a BareMetal cluster — verify BareMetal A record logic in dns_registration.yml still works

When cert-manager runs a DNS-01 ACME challenge for the central route,
it creates a _acme-challenge TXT record which produces a DNS Empty
Non-Terminal (ENT). The ENT causes the wildcard *.apps.* to stop
matching the central hostname, failing the ACME challenge for all
issuers (Google, ZeroSSL) and falling back to selfsigned.

Fix: add dns_cname_pre_cert.yml which pre-creates a specific CNAME
for the central route before cert issuance. The CNAME and the
_acme-challenge record coexist as siblings — no ENT problem.

- Issuer-agnostic: finds any ClusterIssuer with a DDNS webhook
  via json_query, no hardcoded issuer names
- Silently skips on non-DDNS clusters (AWS Route53, GCP, etc.)
- Reads tsigAlgorithm from ClusterIssuer config, nothing hardcoded
- Called from workload.yml before certificate.yml so the CNAME
  exists before any ACME challenge fires
- dns_registration.yml unchanged — BareMetal A record logic unaffected

- name: Pre-create CNAME record before ACME cert issuance
when: ocp4_workload_rhacs_enable_route_certs | bool
ansible.builtin.include_tasks: dns_cname_pre_cert.yml
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this not doing the same thing in dns_registration.yml? We are already registering the CNAME.

@prakhar1985
Copy link
Copy Markdown
Contributor Author

Hey @treddy08 — not quite the same, three key differences:

  1. Record typedns_registration.yml creates an A record (router IP). This creates a CNAME pointing to an existing hostname in the zone. The CNAME is what prevents the ENT problem during ACME challenge.

  2. Platform scopedns_registration.yml is gated to BareMetal/None only (line 17). CNV clusters never enter that block. This new task runs on all platforms and silently skips if no DDNS issuer is found.

  3. Execution orderdns_registration.yml is called at line 173 in workload.yml, after Central is deployed and routes exist. By that point cert issuance has already happened and failed. This new task runs before certificate.yml so the CNAME exists when cert-manager fires the ACME challenge.

dns_registration.yml also needs the live central route to exist (r_central_route_dns) to resolve the router IP — this task computes the hostname from variables so it can run before anything is deployed.

@prakhar1985
Copy link
Copy Markdown
Contributor Author

Also to address the concern about which issuer's credentials are used — the json_query picks the first DDNS-capable ClusterIssuer it finds, which could be acme-bifrost-production-ddns (Google) or acme-bifrost-production-ddns-fallback (ZeroSSL). But it doesn't matter which one is picked for two reasons:

TSIG credentials are per-cluster, not per-CA. Both issuers point to the same ddns01.infra.demo.redhat.com server, same zone, and same cert-manager-tsig-creds secret. Reading creds from either issuer gives the same result.

The CNAME is in the DNS zone — all issuers benefit from it. Once the CNAME exists it's just a DNS record. It doesn't know which CA later runs the ACME challenge. Whether cert-manager ends up using Google, ZeroSSL, or any future issuer, they all hit the same DNS zone and find the same CNAME — the ENT problem is gone for all of them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants