Skip to content

feat: Azure Key Vault native Certificate-object storage mode#118

Closed
rocogamer wants to merge 41 commits into
fabriziosalmi:mainfrom
rocogamer:feature/keyvault-certificate-objects
Closed

feat: Azure Key Vault native Certificate-object storage mode#118
rocogamer wants to merge 41 commits into
fabriziosalmi:mainfrom
rocogamer:feature/keyvault-certificate-objects

Conversation

@rocogamer
Copy link
Copy Markdown
Contributor

@rocogamer rocogamer commented May 7, 2026

Summary

Adds a configurable storage_mode to the Azure Key Vault storage backend so it can persist certificates as native Certificate objects (PKCS12, issuer_name=\"Unknown\") in addition to — or instead of — the existing per-PEM Secrets layout. Native Certificate objects are consumed directly by App Service, Application Gateway, Front Door, API Management and AKS Ingress without manual PFX export.

  • storage_mode: secrets (default) — current per-PEM Secrets layout, fully backwards-compatible.
  • storage_mode: certificate — single Certificate object, metadata stored as tags.
  • storage_mode: both — write to both surfaces; reads prefer Secrets (cheaper).

A new admin endpoint POST /api/storage/azure-keyvault/backfill-certificates imports Certificate objects for domains already stored as Secrets (skipping ones already imported); restricted to storage_mode='both' so the legacy Secrets are still listed during the walk. Exposed as a "Backfill Certificate objects" button in Settings → Storage when the backend mode allows it.

Implementation notes

  • Composition: AzureKeyVaultBackend keeps auth/naming/retry; the new private _AzureKeyVaultCertificateImporter encapsulates Certificate-API specifics. Both Azure SDK clients are lazy so an install without azure-keyvault-certificates only fails when a Certificate-mode call is actually attempted.
  • Tags ↔ metadata round-trip lives in a single class-level helper with a strict allow-list, so vault-level tags from Azure Policy (Environment, CostCenter, …) do not pollute the rehydrated metadata. Truncated SAN lists drop the (incomplete-by-construction) trailing fragment to avoid exposing a malformed FQDN.
  • Surface independence: in both mode store_certificate and delete_certificate run each surface under its own try/except. A Secrets-API outage no longer aborts the Certificate-object write (and vice versa); the method returns False on partial failure so callers can react.
  • Pre-existing bug fixed in passing: _list_secret_domains filtered endswith('-metadata'), which never matched anything in production because _sanitize_secret_name always appends an 8-char CRC32 suffix. The filter is now anchored as ^cert-.+-metadata-[0-9a-f]{8}$. Without this, list_certificates() and the new backfill endpoint would have walked an empty list.
  • Settings migration: backfills storage_mode='secrets' for upgraded installs at load time so existing settings.json files keep working without manual edits.
  • Service Principal permissions: documented in docs/architecture.md. secrets mode is unchanged; certificate/both modes additionally require Certificates Get/List/Import/Delete and keep Secrets Get/List (Key Vault exposes the imported PFX, including the private key, only via the Secret with the same name as the Certificate object).
  • Sanitised error messages: both StorageBackendTest and the new backfill endpoint return only the exception type to the client; full Azure SDK detail (tenant ids, request ids, WWW-Authenticate headers) stays in logger.error.

Files changed

  • modules/core/storage_backends.py — new helper class, three modes, surface-independent store/delete, CRC-aware list filter.
  • modules/core/settings.py — nested-default migration for azure_keyvault.storage_mode.
  • modules/api/{models,resources}.pystorage_mode field on the model, new backfill endpoint, Certificate-API probe in StorageBackendTest.
  • templates/partials/settings_storage.html + static/js/settings.js — storage-mode selector and Backfill button (visible only when active mode is both).
  • requirements-azure-storage.txt, requirements-storage-all.txtazure-keyvault-certificates>=4.7.0.
  • docs/architecture.md — three-mode table + Service Principal permissions matrix.
  • RELEASE_NOTES.md — Unreleased entry.
  • tests/test_azure_keyvault_certificate_storage.py — 38 new unit tests with mocked SDK clients.

Test plan

  • pytest tests/test_azure_keyvault_certificate_storage.py -v — 38/38 pass (2 skip cleanly when azure-keyvault-certificates is not installed).
  • pytest tests/ -m unit -v — full unit suite, 0 regressions.
  • AST parse on every modified Python file.
  • Backwards-compat: load a settings.json without storage_mode, confirm it migrates to secrets silently.
  • Smoke test against a real Azure Key Vault (requires a Service Principal with Certificates + Secrets permissions):
    1. Configure SP in UI with storage_mode=both; press Test connection — both Secrets and Certificate API access verified.
    2. Issue a new cert from the dashboard — confirm in Azure Portal that Secrets cert-…-{cert,chain,fullchain,privkey}-pem-{crc} AND a cert-{domain}-{crc} Certificate object (with domain/dns_provider/staging tags) both appear.
    3. Renew the cert — confirm a new version of the same Certificate object (not a new object).
    4. Switch to storage_mode=certificate; issue a new cert — only the Certificate object is created, no Secrets.
    5. GET /api/certificates/{domain} — returns the four PEMs reconstructed from the PFX export.
    6. DELETE /api/certificates/{domain} — both surfaces removed in both mode.
    7. Switch back to both with legacy Secrets-only domains, press Backfill Certificate objects — every legacy domain gets an imported Certificate object; already-imported ones are reported as skipped.
    8. Bind the Certificate object from an Azure App Service / Application Gateway with managed identity — TLS works without manual PFX export.

Notes for reviewers

  • Default storage_mode='secrets' means existing installs see no behaviour change after upgrade.
  • The azure-keyvault-certificates SDK is in the optional storage extras (requirements-azure-storage.txt); CI does not install it by default, so the two SDK-touching tests skip cleanly under pytest.importorskip.
  • The _with_retry decorator is currently shadowed by inner try/except blocks across all Azure KV methods (and other backends). This is pre-existing in main and out of scope here, but worth a follow-up issue.

🤖 Generated with Claude Code

imartinezgr and others added 30 commits May 7, 2026 13:52
Adds a configurable storage_mode to the Azure Key Vault backend so it can
persist certificates as native Certificate objects (PKCS12, issuer
'Unknown') in addition to — or instead of — the existing per-PEM Secrets
layout. Native Certificate objects are consumed directly by App Service,
Application Gateway, Front Door, API Management and AKS Ingress without
manual PFX export.

Modes (default 'secrets' preserves existing behaviour):
  secrets     — current per-PEM Secrets layout, unchanged
  certificate — single Certificate object, metadata stored as tags
  both        — write to both surfaces, prefer Secrets on read

The new POST /api/storage/azure-keyvault/backfill-certificates admin
endpoint imports Certificate objects for domains already stored as
Secrets (skipping ones already imported); restricted to storage_mode=
'both' so the legacy Secrets are still listed during the walk.

Implementation notes:
  - Composition: AzureKeyVaultBackend keeps auth/naming/retry; the new
    private _AzureKeyVaultCertificateImporter encapsulates Certificate-
    API specifics. Both clients are lazy.
  - Tags ↔ metadata round-trip lives in a single class-level helper with
    a strict allow-list, so vault-level tags from Azure Policy do not
    pollute the rehydrated metadata.
  - Truncated SAN lists drop the (incomplete-by-construction) trailing
    fragment on rehydrate to avoid exposing a malformed FQDN.
  - The metadata-secret list filter is anchored to the CRC32 suffix that
    _sanitize_secret_name appends, fixing a pre-existing bug in main
    where endswith('-metadata') matched zero secrets in production.
  - Settings migration backfills storage_mode='secrets' for upgraded
    installs.
  - StorageBackendTest probes the Certificate API explicitly when the
    target mode requires it; failure messages expose only the exception
    type to the client (full detail in server logs).

Tests: 32 unit tests with mocked SDK clients cover all three modes,
backfill flow pre-conditions, retrieve fallback paths in 'both' mode,
SAN truncation edge cases, settings migration and external-tag
filtering. Suite of 58 unit tests passes with no regressions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The docstring previously said "rehydrated from the Certificate object's
tags". The actual code reads secret.properties.tags from the companion
Secret — Azure mirrors them, so the behaviour matches the description in
practice, but the docstring promised something the code doesn't literally
do. Spelling out the mirroring (and the deliberate read-from-Certificate
in the 'both' fallback) makes the contract auditable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The backfill endpoint was returning f'error: {per_domain}' to the client,
which exposes the full Azure exception text (tenant id, request id,
WWW-Authenticate headers). The companion StorageBackendTest endpoint
already only returns the exception type. Match the pattern here so an
admin-only surface still doesn't leak SDK internals into the response
body — full detail stays in the server log.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The backend rejects backfill unless storage_mode='both' (in 'certificate'
mode list_certificates() does not surface the legacy Secrets so the walk
would silently no-op). Match the UI to that contract instead of letting
the user click into a 400 — the original heuristic toggled visibility
when mode != 'secrets', which falsely advertised the action in
certificate-only mode.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The sanitiser never touched self — it's a pure transformation of its name
argument. Marking it @staticmethod lets callers (including the importer's
``sanitize_name`` callback wiring and the test helper) reach it without a
backend instance, removing the AzureKeyVaultBackend.__new__(...) trick
the round-trip test was using to call it during fixture setup.

The bound-method-as-callback wiring at _get_cert_importer() keeps working
because Python returns the underlying function for staticmethods accessed
via an instance.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
In 'both' mode the Secrets and Certificate surfaces are independent and a
write to one should not block the other. The previous flow caught a
Secrets-surface exception with the outer try/except, which short-
circuited the method before the Certificate-object branch ran. A return-
False from the same call (no exception) by contrast did fall through and
import the cert — asymmetric, surprising, and worse during a real outage
of the Secrets API since it would silently fail to update App
Gateway-bound certs that consume the Certificate object.

Each surface now runs under its own try/except. The method returns True
only when every active surface succeeded, so partial failures still
propagate to the caller. Domain validation still aborts up-front (no
point attempting either write with an invalid domain).

Adds a regression test that simulates a Secrets API outage and confirms
the Certificate import is still attempted and the method returns False.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous flow wrapped both deletion surfaces in a single try/except,
so a Secrets-API outage would skip the Certificate-object deletion and
leave the cert orphaned in the vault (and vice versa). store_certificate
already runs each surface under its own try/except after the previous
commit; delete now matches.

Additional fix: _delete_secrets returns a bool to signal partial per-file
failures, but delete_certificate was discarding it and always logging
"successfully deleted". The bool now propagates so the overall result is
False whenever any surface (or any per-file delete inside the secrets
walk) reports a failure.

Tests added:
  - test_both_mode_certificate_exception_does_not_skip_secrets_delete
    (symmetric to the store regression test)
  - test_both_mode_double_failure_returns_false
  - test_both_mode_partial_secret_failure_returns_false (validates the
    bool propagation from _delete_secrets)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous regression test only exercised "Secrets fails, Certificate
succeeds". The 'both' surface-independence contract has three failure
shapes worth pinning:

  - Cert fails, Secrets succeeds (mirror of the existing test)
  - Both surfaces fail (each is still attempted, return False)

These are easy to add now that the store path has its surface-isolated
try/except blocks. Closes the asymmetric coverage gap left after the
49b3bec fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… ambiguous flag

certbot-dns-azure exposes multiple --dns-azure-* options
(--dns-azure-credentials, --dns-azure-config,
--dns-azure-propagation-seconds). The default
DNSProviderStrategy.configure_certbot_arguments appended --{plugin_name},
which certbot's argparse rejected as ambiguous:

  certbot: error: ambiguous option: --dns-azure could match
  --dns-azure-propagation-seconds, --dns-azure-config,
  --dns-azure-credentials

This is the same class of bug that 4ea7269 fixed for DuckDNS via a
per-strategy override. Generalising the fix in the base method makes
every plugin that lacks an override (Azure, Cloudflare, Google,
DigitalOcean, Linode, Gandi, OVH, Namecheap, ArvanCloud, Infomaniak,
EdgeDNS, Hetzner, …) immune to the same trap going forward, since
``--authenticator <name>`` is the canonical, prefix-collision-free
selector documented by certbot.

The existing per-provider overrides (DuckDNS, AcmeDNS, PowerDNS) keep
working — they already use --authenticator — and are left untouched in
this commit to keep the diff minimal.

Tests:
  - test_azure_uses_authenticator_not_shorthand pins the exact failure
    mode reported in fabriziosalmi#113.
  - test_authenticator_selector_used[Cloudflare/Google] guards the base
    method against regression on providers without an override.
  - test_domain_alias_logs_cname_hint confirms the CNAME-delegation log
    line still fires after the selector switch (the original report's
    setup uses CNAME delegation).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
requirements-azure.txt pinned certbot-dns-azure==2.11.0, a version that
was never published to PyPI. ``pip install -r requirements-azure.txt``
failed with "No matching distribution found", forcing operators to drop
the pin or downgrade by hand (the issue reporter ended up on 2.2.0).

The latest version actually published on PyPI is 2.6.1, which is the
one already documented in docs/installation.md. Aligning the
requirements file to that version restores ``pip install`` and matches
the documentation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pure validator + module-level constants (VALID_KEY_TYPES,
VALID_RSA_KEY_SIZES, VALID_ELLIPTIC_CURVES) used by the upcoming
configurable-key-type/size feature. The validator accepts
``(None, None, None)`` to mean "caller did not pick anything yet" so
subsequent layers can hand it untouched API payloads.

RSA sizes capped at 2048/3072/4096 (1024 is insecure, 8192 buys nothing
in practice and slows handshakes). ECDSA curves limited to secp256r1
and secp384r1 — secp521r1 is excluded because Let's Encrypt rejects it
and most consumers (browsers, load balancers) only implement the first
two.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three new top-level keys in the settings template: default_key_type
(rsa|ecdsa), default_key_size (2048|3072|4096) and
default_elliptic_curve (secp256r1|secp384r1). They control the public-
key shape applied to any cert that does not pick a per-domain override.

The migration path lists them in essential_keys so an upgraded install
silently picks up rsa/2048 — the implicit certbot default that CertMate
emitted before this setting existed. No behaviour change unless the
operator opts in.

save_settings now validates the triple via validate_key_options before
persisting. The check is run only on the active branch (RSA → key_size,
ECDSA → elliptic_curve) so toggling one to the other in the UI does not
require both to be wiped on every switch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
build_certbot_command now accepts key_type, key_size and elliptic_curve
kwargs (all default None for backwards compat) and appends the
corresponding certbot flags right after the SAN -d flags. When a caller
omits all three the command is byte-identical to the pre-feature output
so existing call sites keep working unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
create_certificate accepts the three key kwargs and propagates them to
build_certbot_command. When all three are None it falls back to the
global defaults from settings — this lets legacy callers (web routes,
scripts, scheduled renewal jobs that don't carry per-domain state)
inherit the configured default for free without having to fetch it.

The fallback path that runs when ca_manager isn't available also
appends the key flags manually so a stale ca_manager can't silently
downgrade certs to certbot's RSA-2048 default.

Validation runs once inside create_certificate too, in addition to the
API-layer validation, so renew_certificate (which can pass values read
from disk) never feeds a contradictory shape into certbot.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
create_cert_model gains key_type / key_size / elliptic_curve as
optional fields with proper enum constraints, and settings_model gains
the matching default_key_* triple so swagger reflects what the backend
accepts.

CreateCertificate.post extracts the three values, validates them
up-front (returning 400 with the field-specific reason on failure
instead of letting certbot fail later with a stack trace), passes them
to create_certificate, and persists the operator's explicit overrides
into the domain entry so the UI can roundtrip them.

Inheritors (i.e. certs that did not pick anything per-cert) intentionally
do NOT have their effective shape persisted in the entry — that lets a
later change to the global default apply to those certs while leaving
explicit overrides untouched.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A new sub-block inside the existing "Advanced Options" panel of the
Create Certificate form lets the operator pick a per-cert shape. The
Key Type selector defaults to "Use global default" so the form sends
no key fields by default and the backend inherits the configured
default; picking RSA reveals the size selector, picking ECDSA reveals
the curve selector. The two extra selectors are wired to a single
toggleCertKeyOptions() helper that mirrors the show/hide pattern of
the existing storage-backend selector.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Settings → General now exposes the three default_key_* selectors under
Certificate Management Settings. The size and curve pickers are mutually
exclusive (toggleDefaultKeyOptions hides whichever is inactive) so the
form never posts a contradictory pair. Load and save flows wire the
selectors into the existing populate/collect helpers and the values
roundtrip cleanly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
20 unit tests in tests/test_key_options.py:

  - validate_key_options accept matrix (3 RSA sizes + 2 ECDSA curves +
    the all-None "use defaults" case)
  - validate_key_options rejection matrix (unknown type, RSA with bad
    size, RSA missing size, RSA with curve, ECDSA with size, ECDSA
    missing curve, ECDSA with unsupported curve, partial shape with
    type=None)
  - build_certbot_command emits --key-type / --rsa-key-size for RSA
  - build_certbot_command emits --key-type / --elliptic-curve for ECDSA
  - build_certbot_command with no key kwargs emits no key flags (the
    backwards-compat regression guard)
  - build_certbot_command with key_type='rsa' but key_size=None does
    not emit a half-flag that would crash certbot
  - SettingsManager.load_settings backfills rsa/2048/secp256r1 on a
    legacy settings.json that has no default_key_* keys
  - SettingsManager.save_settings rejects an inconsistent triple
    (key_type=rsa with key_size=1024)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
architecture.md gets the three new settings keys in the Configuration
Structure example plus a small table explaining when each applies and
how per-domain overrides interact with the globals. RELEASE_NOTES.md
gets an Unreleased entry covering the operator-facing surface.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
certbot keeps the per-cert config (key shape, DNS plugin, CA, etc.) in
``certificates/{domain}/renewal/<domain>.conf`` and reads it back during
``certbot renew --cert-name``. That works fine when the certificates/
directory lives on a persistent volume, but in Docker/K8s deployments
where the local filesystem is ephemeral and the cert PEMs are restored
from a remote storage backend on pod restart, the renewal config is
lost. The renewal then either fails outright (cert not found) or
silently downgrades the cert to certbot's RSA-2048 default — losing
the shape the operator configured.

Persist the cert-config inputs (key_type / key_size / elliptic_curve /
ca_provider / domain_alias) into metadata.json at create time so a
later ``renew_certificate`` can reconstruct the original config from
the storage backend even when the renewal/<domain>.conf is gone.
``store_certificate`` already syncs metadata.json to the configured
backend (AzureKeyVault, AWS, Vault, Infisical, …), so the persisted
shape survives ephemeral filesystems.

This commit only widens the metadata payload; the renewal path that
consumes the new fields ships in a follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
create_certificate now accepts ``force=False``. When True it skips the
"already exists" guard and appends ``--force-renewal`` to the certbot
command so the existing cert is replaced. This is the entry point that
``renew_certificate`` will use in the next commit when the certbot
renewal/<domain>.conf is missing — typical Docker/K8s ephemeral-
filesystem case where the PEMs were restored from a remote storage
backend but the renewal conf was not.

The flag is wired through all three command-build paths (ca_manager
happy path, ca_manager fallback for older signatures, and the no-
ca_manager branch) so a force renew works identically regardless of
how the cert was originally created.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
renew_certificate now detects a missing
``certificates/{domain}/renewal/<domain>.conf`` and rebuilds the cert
from scratch using the persisted metadata.json instead of letting
``certbot renew`` fail or silently regenerate with the wrong shape.

This is the Docker/K8s ephemeral-filesystem story: containers/pods
without a PVC on certificates/ rehydrate the cert PEMs from the remote
storage backend on startup, but certbot's own renewal conf does not
travel through the storage backend — only the PEMs and metadata.json
do. A naïve ``certbot renew --cert-name`` either errors out ("No
matching certificates found") or, worse, regenerates the cert with
certbot's RSA-2048 default and loses the original key shape.

The new ``_renew_from_metadata`` helper:
  - reads metadata.json (including the key_type / key_size /
    elliptic_curve / SAN list / DNS plugin / CA persisted at create
    time by the previous commit),
  - falls back to the per-domain settings entry if metadata is gone,
  - falls back to the global default key shape as a last resort,
  - delegates to create_certificate(force=True), which already knows
    how to bypass the "already exists" guard and pass --force-renewal.

The PVC-backed setup keeps using the cheap ``certbot renew`` fast
path; the rebuild only kicks in when the conf is truly missing.

The non-reentrant per-domain lock had to grow a ``lock_owned`` flag
because the rebuild path hands the lock over to create_certificate
(which acquires it again) — the finally must not double-release.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three scenarios exercise the new ephemeral-filesystem renewal path:

  - renewal/<domain>.conf missing AND metadata.json present (typical
    K8s pod restart with remote storage backend): rebuild from
    metadata, key shape and DNS plugin preserved, force=True passed
    through to create_certificate.
  - renewal/<domain>.conf AND metadata.json both missing, but the
    per-domain settings entry still has dns_provider/account/key
    overrides: rebuild from settings, overrides honoured.
  - filesystem fully wiped, no email anywhere: hard-fail with a
    descriptive RuntimeError instead of silently doing the wrong
    thing.

The first test caught a real bug while being written: the rebuild
path released the per-domain lock from inside _renew_from_metadata
while the caller's finally block still expected to own it, causing a
"release unlocked lock" RuntimeError on the happy path. Now the
caller releases the lock before delegating and tracks the handoff
with a flag.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both RELEASE_NOTES.md (operator-facing) and docs/architecture.md
(reference) now describe the PVC-fast-path vs ephemeral-rebuild
fallback so deployers know what to expect under Docker/K8s without
persistent volumes on certificates/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous renew_certificate fix released the per-domain Lock before
delegating to _renew_from_metadata so that create_certificate could
reacquire it without deadlock. That release+reacquire window was a
race: a concurrent delete_certificate(domain) on another thread could
slip in, blow away cert.pem, and let the rebuild path then resurrect a
cert the operator had just asked to remove.

Switching the per-domain lock to threading.RLock makes it reentrant
on the same thread, so renew_certificate can hold the lock continuously
through _renew_from_metadata → create_certificate. Cross-thread
exclusion is unchanged (RLock blocks other threads exactly like Lock
does), so a concurrent delete still has to wait for the renew to
finish — closing the race.

The lock_owned flag and the manual release/reacquire dance disappear
with the change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The metadata-driven renewal rebuild from the previous commit only fires
when ``cert.pem`` is on disk but ``renewal/<domain>.conf`` is gone. In
the actual Docker/K8s scenario the feature documented (pod on emptyDir
with a remote storage backend as the source of truth) ALL local files
are missing on restart — including cert.pem — so renew_certificate
would still hard-fail with FileNotFoundError before the rebuild path
was even reached.

This commit closes that gap. ``CertificateManager.hydrate_from_storage``
walks ``settings.domains`` at startup, calls ``retrieve_certificate``
on the configured backend for every domain whose cert.pem is missing
locally, and writes the four PEMs plus metadata.json into
``cert_dir/<domain>/``. The certbot ``renewal/<domain>.conf`` is
intentionally NOT recreated — the storage backend doesn't carry it,
and the metadata-driven rebuild path handles its absence on the first
post-restart renew.

Wired in ``modules/core/factory.py`` after the CertificateManager is
constructed. Failures are logged but never raised: a single unreachable
cert must not block app startup. The private key is chmod-ed to 0600
on restore so the rehydrated copy doesn't widen perms vs the original
certbot-issued file.

Tests cover: happy-path restore (with key shape persisted in metadata),
local-copy-already-present skip, backend-has-no-record marker, partial
failure isolation across domains, storage_manager=None no-op, and the
privkey.pem 0600 chmod.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Original RELEASE_NOTES and architecture.md described a two-tier story
(certbot fast path vs metadata-driven rebuild) and claimed the rebuild
"covered Docker without a PVC". A reviewer correctly pointed out that
without a startup-hydration step the rebuild path could not actually
fire on a fresh volume — cert.pem itself is missing, so
``renew_certificate`` would die with FileNotFoundError before the
rebuild even ran. The previous commit added that hydration step; this
commit updates the docs to reflect the real three-layer design:

  1. PVC fast path (certbot renew, conf present)
  2. Startup rehydration (PEMs and metadata.json restored from the
     remote storage backend; renewal conf intentionally not restored)
  3. Metadata-driven rebuild on first post-restart renew

The "always preserves the historical shape" claim is also corrected:
there's a single edge case — fresh volume AND empty storage backend —
where the rebuild has to fall back to whatever per-domain overrides
survive in settings.domains and, last-resort, the current global
default. Documenting that honestly is cheaper than promising more
than the code can deliver.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The "persist only explicit overrides" rule lived inline inside the
CreateCertificate.post body, which made it hard to pin via tests
without standing up the whole Flask-RESTX stack. Extracted to a pure
``build_domain_entry`` helper in modules/core/utils.py — the endpoint
delegates to it. Same behaviour, now testable.

Adds three test classes a previous reviewer flagged as missing:

  - TestDomainEntryPersistence (T2): inheritor entries have no
    key_* fields; RSA override stores type+size only; ECDSA
    override stores type+curve only.
  - TestEndpointPayloadValidation (T1): six negative cases that the
    endpoint must reject with a 400 (rsa+1024, rsa+curve, ecdsa+size,
    ecdsa+secp521r1, dsa, size-without-type), plus the all-None
    happy-path that must NOT 400.
  - TestCorruptMetadataInRebuildPath (T3): if metadata.json is
    malformed, _renew_from_metadata logs and falls back to the
    settings.domains entry instead of letting json.JSONDecodeError
    propagate. The test loads a corrupt blob and asserts
    create_certificate is still invoked with the per-domain
    settings overrides.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Native Certificate-object storage mode for the Azure Key Vault backend
(secrets / certificate / both), plus surface-independence in store and
delete, sanitised error messages and 38 unit tests. PR fabriziosalmi#118 against
upstream is open and waiting on review.
Generalises the --authenticator selector across all DNS plugins so a
plugin with several --<plugin>-* options does not break with 'ambiguous
option' (Azure case in upstream issue fabriziosalmi#113). Pins certbot-dns-azure to
2.6.1 since the previous 2.11.0 pin was a non-existent PyPI version.
PR fabriziosalmi#119 against upstream is open and waiting on review.
imartinezgr and others added 7 commits May 7, 2026 15:42
Configurable cert key type/size (RSA 2048/3072/4096 or ECDSA
secp256r1/secp384r1) with a global default plus per-cert override.
Includes the Docker/K8s-aware renewal flow: startup hydration from the
storage backend, metadata.json carries the original cert config, and
renew_certificate rebuilds via certbot certonly --force-renewal when
the renewal conf is gone. Held back from upstream PR pending local
smoke tests.

# Conflicts:
#	RELEASE_NOTES.md
…bot 2.x compat)

Previous commit pinned to 2.6.1 because that's what docs/installation.md
listed and what PyPI's "latest" pointed at. Caught at install time:
2.6.0 and 2.6.1 require ``certbot>=3.0,<4.0``, but the repo's main
``requirements.txt`` is on ``certbot==2.10.0``. ``pip install -r
requirements-azure.txt`` (or building the Docker image with
EXTRA_REQUIREMENTS=requirements-storage-all.txt) hit a
ResolutionImpossible:

    certbot-dns-cloudflare 2.10.0 depends on certbot>=2.10.0
    certbot-dns-azure 2.6.1 depends on certbot<4.0 and >=3.0

Verified via pypi.org/pypi/certbot-dns-azure/<v>/json that 2.5.0 is
the latest that still declares ``certbot<3.0,>=2.0`` — perfect for the
2.10.0 line. 2.4.0 / 2.3.0 / 2.2.0 (the version the issue reporter
fell back to manually) are all in the same compatibility band; 2.5.0
is the most recent of them.

Note: ``docs/installation.md`` keeps documenting 2.6.1 because that
section describes an alternative install with ``certbot==4.1.1``,
where 2.6.1 IS compatible. Only the default install path (which uses
the pinned ``certbot==2.10.0``) needed adjusting.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous merge pinned to 2.6.1, which requires certbot 3.x and
broke pip install when combined with the repo's certbot==2.10.0.
2.5.0 is the latest in the certbot-dns-azure line that still declares
certbot<3.0,>=2.0. PR fabriziosalmi#119 already updated.
The Dockerfile builder stage only ever copied requirements.txt and
requirements-minimal.txt, so building the image with REQUIREMENTS_FILE
pointed at a different file (e.g. requirements-storage-all.txt) failed
with "No such file or directory" — the file simply wasn't in the build
context inside the layer.

Two changes:

  1. Copy every requirements*.txt into the builder. The COPY pattern
     is cached on its own layer, so adding more files there does not
     rebuild the heavy pip-install layer below it.
  2. New build-arg EXTRA_REQUIREMENTS (default empty). When set, a
     second pip install runs after the main one. Lets operators bake
     in the optional remote storage backends (Azure Key Vault, AWS
     Secrets Manager, HashiCorp Vault, Infisical) at image build
     time without having to maintain a hand-merged "all-in-one"
     requirements file.

docker-compose.yml exposes the new arg via ${EXTRA_REQUIREMENTS:-} so
``EXTRA_REQUIREMENTS=requirements-storage-all.txt docker compose
build`` Just Works. The default remains empty, preserving the slim
build for users that do not need the extra backends.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Single-file EXTRA_REQUIREMENTS forced operators to choose between
storage backends and extra DNS plugins; bundling both into one image
required a hand-merged requirements file. The arg now accepts a
list — quote it on the command line so the shell preserves the
spaces:

    --build-arg EXTRA_REQUIREMENTS="requirements-azure.txt requirements-storage-all.txt"

The RUN step iterates with a shell for-loop and runs ``pip install``
once per file. Layer caching is unaffected — the COPY of all
requirements*.txt above is the only thing that invalidates this layer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Azure DNS is one of the four "Major cloud providers" CertMate already
ships out of the box (Cloudflare / Route53 / DigitalOcean / Google).
Splitting it into the optional ``requirements-azure.txt`` produced
runtime errors for users that picked Azure DNS in the UI of a default
build:

    Certificate creation failed: The certbot plugin 'dns-azure' is not
    installed. Install it with: pip install certbot-dns-azure

Pinned to 2.5.0 — the latest of the certbot-dns-azure 2.x line that
still declares ``certbot<3.0,>=2.0`` and is therefore compatible with
the repo's ``certbot==2.10.0`` pin. Versions 2.6.0/2.6.1 jump to
certbot 3.x and would block ``pip install -r requirements.txt``.

``requirements-azure.txt`` is left in place for users who build the
slim image (``REQUIREMENTS_FILE=requirements-minimal.txt``) and want
Azure DNS layered on top.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous Azure DNS support shipped a credentials INI with keys that
certbot-dns-azure does not recognise — ``dns_azure_subscription_id``,
``dns_azure_resource_group``, ``dns_azure_client_id``,
``dns_azure_client_secret``. The plugin silently ignored them and
aborted at validation time with the misleading

    /etc/letsencrypt/.../azure.ini: No authentication methods have
    been configured for Azure DNS. Either configure a service
    principal, system/user assigned managed identity or configure the
    use of azure cli or workload identity credentials

Inspecting certbot_dns_azure._internal.dns_azure shows the real
schema:

    dns_azure_tenant_id        = ...
    dns_azure_sp_client_id     = ...   (NOT 'client_id')
    dns_azure_sp_client_secret = ...   (NOT 'client_secret')
    dns_azure_zoneN            = <FQDN>:<RESOURCE_GROUP_RESOURCE_ID>

The zone mapping is required — without it the plugin raises "At least
one zone mapping needs to be provided". The mapping is built from the
cert's primary domain (with any leading ``*.`` stripped, since the
wildcard cert lives in the apex zone), pointing at
``/subscriptions/<SUB>/resourceGroups/<RG>``. Azure's plugin matches
the longest FQDN suffix at validation time, so a single mapping at
the apex zone covers every SAN that lives in it.

Plumbing changes to make ``domain`` reachable inside
``create_azure_config``:

  - DNSProviderStrategy.create_config_file gains an optional
    ``domain: Optional[str] = None`` argument. The base ABC keeps the
    default so non-Azure strategies do not have to thread the value
    through; bulk-applied to all 17 concrete strategies for signature
    consistency.
  - certificates.py:574 propagates the cert's primary domain when it
    invokes the strategy.
  - create_azure_config raises ValueError when domain is missing —
    surfaces a misuse early rather than letting certbot fail later
    with a vague message.

Tests:
  - test_writes_sp_keys_and_zone_mapping pins the exact INI shape and
    the absence of the legacy keys.
  - test_wildcard_domain_strips_leading_asterisk_in_zone_mapping
    covers the wildcard case.
  - test_missing_domain_raises pins the defensive check.
  - The two existing AzureStrategy tests now pass ``domain=`` so they
    exercise the same code path as production.

Closes the runtime half of issue fabriziosalmi#113 (the package-not-installed half
was already addressed by 4ea7269 + the certbot-dns-azure pin).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@fabriziosalmi
Copy link
Copy Markdown
Owner

This is a substantive 1700-line feature with a careful test surface (38 unit tests, mocked SDK clients, backwards-compat migration, three modes, CRC-aware list filter fix in passing). I want to give it a proper review rather than a rushed one — the architecture decisions (composition pattern, surface-independent store/delete, the metadata allow-list, sanitised error messages) deserve careful reading.

Realistic timeline: I'll target this as a v2.5.0 candidate (after the smaller in-flight items: #119 fix #113, #122 dns alias rebase, #120 *_FILE secrets, #126 download ?file= param all land). The patch versions in this 2.4.x line are issue-triage only; this PR is feature-shaped and deserves a minor bump.

Quick first-pass comments (not blockers — putting them here so they're not lost):

  1. `storage_mode='both'` semantics: writes go to both surfaces independently (good — Secrets outage doesn't block Certificate write), reads prefer Secrets ("cheaper" — agreed). What happens on `get_certificate(domain)` when Secrets has version N and Certificate-object has version N-1 (e.g. a Certificate-mode write succeeded but the Secrets-mode write later failed in a previous renewal)? Worth confirming the read returns the freshest version that exists, not unconditionally Secrets.
  2. The CRC-aware list_secret_domains regex fix is its own bug worth calling out in the release notes — without it, list_certificates() walked an empty list silently. Pre-existing, but invisible to anyone who hadn't noticed certs disappearing from the list view.
  3. Backfill endpoint restricted to storage_mode='both': clear and correct.

Will do the full pass — checked the test file already exists in the diff and has good coverage of the helper class. Ping me if you want me to start the review now and just deferred merging, vs deferred everything.

APScheduler jobs run in background threads where Flask's thread-local
current_app proxy is unbound, causing RuntimeError when the jobs tried
to access current_app.config.  Fix by:

- keeping a module-level _flask_app reference
- setting app.config['MANAGERS'] so jobs can resolve managers
- extracting a shared _run_manager_job() helper with logging and
  exception handling
- ordering create_app() so _flask_app is ready before the scheduler
  starts (fixes race on recovered persistent jobs)
Synchronize with upstream fabriziosalmi/certmate main:
- Domain alias mode (dns_alias_hook.py, CNAME chain verification)
- v2.4.7 base image bump bookworm → trixie
- SECRET_KEY_FILE / API_BEARER_TOKEN_FILE support
- ?file= param on /api/certificates/<domain>/download
- CertMate.html auto-escape helper + UI refactor
- Security bumps (requests CVE, postcss 8.5.10)
- CI workflow fix for fork-originated PRs

Merge preserves our fork features:
- Configurable certificate key type/size (RSA/ECDSA)
- Azure Key Vault native Certificate-object support
- APScheduler app_context fix (background jobs)
- Hydrate-from-storage on startup
- Metadata-driven renewal rebuild

Resolved conflicts:
- modules/core/certificates.py (domain alias + key type + Azure KV)
- static/js/dashboard.js (alias indicator + CertMate.html)
- tests/test_issue113_azure_dns_ambiguous.py
- RELEASE_NOTES.md (combined unreleased + upstream versions)
- provide localhost fallback email when metadata.json lacks one
  to avoid RuntimeError in _renew_from_metadata during domain-alias
  renewal rebuild (merge artefact from upstream+ours code paths)
- fix _manager test mock to include required email field
Address maintainer feedback from fabriziosalmi#118 review:

1. Stale read detection in 'both' mode: retrieve_certificate now
   compares updated_on timestamps of Secrets and Certificate surfaces,
   returning whichever is freshest. Prevents returning stale data
   when one surface succeeds but the other fails during renewal.

2. New _retrieve_from_secrets returns latest_update timestamp
   alongside cert_files and metadata so callers can compare.

3. New get_certificate_update_time() helper on the importer.

4. Tests: 2 new cases covering Certificate-newer-than-Secrets
   and Secrets-newer-than-Certificate divergence (40 total).

5. Release notes: stale read detection entry + separate CRC-aware
   regex fix entry (per maintainer request).

6. Updated existing 3 tests to mock updated_on with real datetimes.
@rocogamer
Copy link
Copy Markdown
Contributor Author

Thanks for the thoughtful first-pass review! I've addressed all three points you raised:

1. Stale read detection in both mode

You flagged the real risk: get_certificate returning stale Secrets data when the Certificate-object surface is newer. Fixed by making _retrieve_from_secrets return the updated_on timestamp alongside the PEMs/metadata, and adding a get_certificate_update_time() helper on the importer. retrieve_certificate now compares timestamps of both surfaces and returns whichever is freshest:

  • Secrets present, Cert absent → return Secrets (no change, cheaper path)
  • Secrets absent, Cert present → return Cert (fallthrough, was already there)
  • Both present, Cert newer → Cert wins
  • Both present, Secrets newer → Secrets wins (faster path + fresh)

Two new tests cover the skew scenarios in TestRetrieveBothMode (40 total, all green).

2. CRC-aware list_secret_domains regex fix

Agreed this is its own bug. Added a standalone ### Bug Fixes entry in RELEASE_NOTES.md calling out that endswith('-metadata') silently matched zero secrets in production because _sanitize_secret_name always appends an 8-char CRC32 suffix. The fix anchors the regex to ^cert-.+-metadata-[0-9a-f]{8}$.

3. Branch sync

Rebased on top of current main (now includes #119, #122, #120, #126, v2.4.7 trixie bump). Merge is clean — MERGEABLE again.

Diff summary

  • modules/core/storage_backends.py — stale read detection logic, new helper
  • tests/test_azure_keyvault_certificate_storage.py — 2 new tests + fixed 3 existing mocks with updated_on datetimes
  • RELEASE_NOTES.md — stale read entry + CRC fix entry

Ready for your full review at your convenience — no rush, understood this is a v2.5.0 candidate.

@rocogamer
Copy link
Copy Markdown
Contributor Author

Apologies — I just noticed that the git merge main I ran to resolve the merge conflicts pulled in way more than intended. The PR diff is now cluttered with commits that have nothing to do with the Azure Key Vault Certificate-object feature.

Here's what leaked in and what each thing is:

Leaked features (not part of this PR):

Feature What it does Why it's here
Configurable certificate key type/size Lets operators choose RSA 2048/3072/4096 or ECDSA P-256/P-384 per cert or as a global default. Touches certificates.py, ca_manager.py, settings.py, utils.py, UI templates, and a 641-line test file (test_key_options.py). Merged from our fork's feat/configurable-key-type-size branch.
APScheduler app_context fix Fixes a RuntimeError: Working outside of application context in cron-triggered background jobs by pushing app.app_context() explicitly. Touches factory.py. Small fix we developed today while testing the merged code.
Build/deploy tweaks EXTRA_REQUIREMENTS build-arg, certbot-dns-azure==2.5.0 pinning, etc. Carried over from our fork's main during the sync.

What belongs to this PR:

  • modules/core/storage_backends.py — original Azure KV three-mode backend + stale read detection
  • tests/test_azure_keyvault_certificate_storage.py — 40 unit tests
  • The Azure KV entries in RELEASE_NOTES.md, modules/api/models.py, modules/api/resources.py, modules/core/settings.py, docs/architecture.md, templates, and static/js/settings.js

Plan: I'll clean this up — either by doing a proper rebase of just the Azure KV commits on top of upstream/main, or by opening a fresh PR with only the relevant changes. Let me know which you prefer and I'll get it sorted quickly. Really sorry for the noise in the diff.

@fabriziosalmi
Copy link
Copy Markdown
Owner

Full Review: Azure Key Vault Certificate-object Storage

Thanks for the thorough work on this — the architecture is genuinely impressive. The composition pattern, surface-independent writes, stale-read detection (addressing my earlier comment about version skew in both mode), metadata allow-list, and sanitised error messages are all well thought out.

🔴 Blocker: PR is contaminated

As you noted in your latest comment, the git merge main pulled in ~1200 lines of unrelated code:

Leaked feature Impact
Configurable key type/size 641-line test file + 322 lines in certificates.py + ca_manager.py + utils.py
APScheduler app_context fix 79 lines in factory.py
Build/deploy tweaks Dockerfile, docker-compose.yml, requirements.txt

Please open a fresh PR with only the Azure KV commits (cherry-pick or interactive rebase). The 3413-line diff is not reviewable in its current state — I need to see only the ~2200 lines that belong to this feature.

✅ Architecture positives (confirmed on code review)

  • Composition: _AzureKeyVaultCertificateImporter encapsulates Certificate-API specifics — clean separation
  • Surface independence: store_certificate() and delete_certificate() wrap each surface in independent try/except — a Secrets outage doesn't block the Certificate write
  • Stale-read detection: retrieve_certificate() in both mode compares updated_on timestamps and returns the freshest — exactly what I flagged in my first comment ✅
  • issuer_name="Unknown": prevents Key Vault from attempting auto-renewal (CertMate stays source of truth)
  • CRC fix: the list_secret_domains regex anchoring is a real bug fix — endswith('-metadata') never matched anything in production
  • Test coverage: 974 lines, 40 unit tests, comprehensive mock coverage

🟡 Non-blocking suggestions for the clean PR

  1. Document that Secret API exposes the private key — when importing a Certificate object, Azure creates a companion Secret containing the full PFX (including private key). Anyone with Secrets/Get can extract it. This is by-design in Azure but operators should be aware.
  2. Backfill rate limiting — the backfill endpoint iterates all domains synchronously. For vaults with hundreds of certs, this may trigger Azure throttling (HTTP 429). Consider a batch_size parameter or background job.
  3. Pin cryptography>=41.0.0load_pem_x509_certificates (plural) was added in v41. Older versions will fail with a confusing ImportError.

Timeline

This is a v2.5.0 candidate — once the PR is rebased clean, I'll do a final pass and merge. No rush, take your time getting the rebase right.

@rocogamer
Copy link
Copy Markdown
Contributor Author

@fabriziosalmi Opened the clean rebase as #139 — same 12 Azure-KV commits cherry-picked onto current upstream/main, none of the leaked feature/build commits. Diff is 1878/126 across 11 files (only the files that belong to this feature).

I also addressed all three non-blocker suggestions from your review in #139:

  1. Secrets/Get private-key exposure — pulled out of the permissions-table cell into a prominent Security note in docs/architecture.md under the storage-modes section.
  2. Backfill rate limiting — added a ?limit=N query parameter on the endpoint; response now includes a remaining count so operators with large vaults can paginate. Pagination is safe because each call is idempotent (already-imported domains are reported as skipped).
  3. cryptography>=41.0.0 floor — added explicitly to requirements-azure-storage.txt and requirements-storage-all.txt. Main requirements.txt already pins cryptography==46.0.7, but the extras were silently relying on the transitive constraint; making the floor explicit means a layered install fails loudly with a pip resolver error rather than an opaque runtime ImportError.

Tests: 40/40 pass.

Closing this one in favour of #139.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants