feat: Azure Key Vault native Certificate-object storage mode (clean rebase of #118)#139
feat: Azure Key Vault native Certificate-object storage mode (clean rebase of #118)#139rocogamer wants to merge 14 commits into
Conversation
Adds a configurable storage_mode to the Azure Key Vault backend so it can
persist certificates as native Certificate objects (PKCS12, issuer
'Unknown') in addition to — or instead of — the existing per-PEM Secrets
layout. Native Certificate objects are consumed directly by App Service,
Application Gateway, Front Door, API Management and AKS Ingress without
manual PFX export.
Modes (default 'secrets' preserves existing behaviour):
secrets — current per-PEM Secrets layout, unchanged
certificate — single Certificate object, metadata stored as tags
both — write to both surfaces, prefer Secrets on read
The new POST /api/storage/azure-keyvault/backfill-certificates admin
endpoint imports Certificate objects for domains already stored as
Secrets (skipping ones already imported); restricted to storage_mode=
'both' so the legacy Secrets are still listed during the walk.
Implementation notes:
- Composition: AzureKeyVaultBackend keeps auth/naming/retry; the new
private _AzureKeyVaultCertificateImporter encapsulates Certificate-
API specifics. Both clients are lazy.
- Tags ↔ metadata round-trip lives in a single class-level helper with
a strict allow-list, so vault-level tags from Azure Policy do not
pollute the rehydrated metadata.
- Truncated SAN lists drop the (incomplete-by-construction) trailing
fragment on rehydrate to avoid exposing a malformed FQDN.
- The metadata-secret list filter is anchored to the CRC32 suffix that
_sanitize_secret_name appends, fixing a pre-existing bug in main
where endswith('-metadata') matched zero secrets in production.
- Settings migration backfills storage_mode='secrets' for upgraded
installs.
- StorageBackendTest probes the Certificate API explicitly when the
target mode requires it; failure messages expose only the exception
type to the client (full detail in server logs).
Tests: 32 unit tests with mocked SDK clients cover all three modes,
backfill flow pre-conditions, retrieve fallback paths in 'both' mode,
SAN truncation edge cases, settings migration and external-tag
filtering. Suite of 58 unit tests passes with no regressions.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The docstring previously said "rehydrated from the Certificate object's tags". The actual code reads secret.properties.tags from the companion Secret — Azure mirrors them, so the behaviour matches the description in practice, but the docstring promised something the code doesn't literally do. Spelling out the mirroring (and the deliberate read-from-Certificate in the 'both' fallback) makes the contract auditable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The backfill endpoint was returning f'error: {per_domain}' to the client,
which exposes the full Azure exception text (tenant id, request id,
WWW-Authenticate headers). The companion StorageBackendTest endpoint
already only returns the exception type. Match the pattern here so an
admin-only surface still doesn't leak SDK internals into the response
body — full detail stays in the server log.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The backend rejects backfill unless storage_mode='both' (in 'certificate' mode list_certificates() does not surface the legacy Secrets so the walk would silently no-op). Match the UI to that contract instead of letting the user click into a 400 — the original heuristic toggled visibility when mode != 'secrets', which falsely advertised the action in certificate-only mode. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The sanitiser never touched self — it's a pure transformation of its name argument. Marking it @staticmethod lets callers (including the importer's ``sanitize_name`` callback wiring and the test helper) reach it without a backend instance, removing the AzureKeyVaultBackend.__new__(...) trick the round-trip test was using to call it during fixture setup. The bound-method-as-callback wiring at _get_cert_importer() keeps working because Python returns the underlying function for staticmethods accessed via an instance. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
In 'both' mode the Secrets and Certificate surfaces are independent and a write to one should not block the other. The previous flow caught a Secrets-surface exception with the outer try/except, which short- circuited the method before the Certificate-object branch ran. A return- False from the same call (no exception) by contrast did fall through and import the cert — asymmetric, surprising, and worse during a real outage of the Secrets API since it would silently fail to update App Gateway-bound certs that consume the Certificate object. Each surface now runs under its own try/except. The method returns True only when every active surface succeeded, so partial failures still propagate to the caller. Domain validation still aborts up-front (no point attempting either write with an invalid domain). Adds a regression test that simulates a Secrets API outage and confirms the Certificate import is still attempted and the method returns False. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous flow wrapped both deletion surfaces in a single try/except,
so a Secrets-API outage would skip the Certificate-object deletion and
leave the cert orphaned in the vault (and vice versa). store_certificate
already runs each surface under its own try/except after the previous
commit; delete now matches.
Additional fix: _delete_secrets returns a bool to signal partial per-file
failures, but delete_certificate was discarding it and always logging
"successfully deleted". The bool now propagates so the overall result is
False whenever any surface (or any per-file delete inside the secrets
walk) reports a failure.
Tests added:
- test_both_mode_certificate_exception_does_not_skip_secrets_delete
(symmetric to the store regression test)
- test_both_mode_double_failure_returns_false
- test_both_mode_partial_secret_failure_returns_false (validates the
bool propagation from _delete_secrets)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous regression test only exercised "Secrets fails, Certificate succeeds". The 'both' surface-independence contract has three failure shapes worth pinning: - Cert fails, Secrets succeeds (mirror of the existing test) - Both surfaces fail (each is still attempted, return False) These are easy to add now that the store path has its surface-isolated try/except blocks. Closes the asymmetric coverage gap left after the 49b3bec fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address maintainer feedback from fabriziosalmi#118 review: 1. Stale read detection in 'both' mode: retrieve_certificate now compares updated_on timestamps of Secrets and Certificate surfaces, returning whichever is freshest. Prevents returning stale data when one surface succeeds but the other fails during renewal. 2. New _retrieve_from_secrets returns latest_update timestamp alongside cert_files and metadata so callers can compare. 3. New get_certificate_update_time() helper on the importer. 4. Tests: 2 new cases covering Certificate-newer-than-Secrets and Secrets-newer-than-Certificate divergence (40 total). 5. Release notes: stale read detection entry + separate CRC-aware regex fix entry (per maintainer request). 6. Updated existing 3 tests to mock updated_on with real datetimes.
When a Certificate object is imported, Key Vault auto-creates a Secret with the same name whose value is the full PFX (including the private key). The pre-existing one-liner inside the Service Principal table was easy to miss; pull it out as a dedicated **Security note** under the storage-modes section so operators sizing RBAC scopes for the new ``certificate``/``both`` modes actually see it. No code change; this only edits ``docs/architecture.md``.
Vaults with hundreds of certificates can trip Azure Key Vault's transaction throttling when the synchronous backfill loop iterates every domain in one shot (worst case: 4-secret read + import per legacy cert). Add an optional ``?limit=N`` query parameter that caps how many domains are processed in a single call, and surface a ``remaining`` count in the response so operators know to call again. Each invocation is idempotent — already-imported domains are reported as ``skipped`` — so pagination is safe. The body of the loop is unchanged; only its termination and the response schema are touched.
The Certificate-object code path in ``modules/core/storage_backends.py`` uses ``cryptography.x509.load_pem_x509_certificates`` (plural form) and the PKCS12 builder. The main ``requirements.txt`` already pins ``cryptography==46.0.7`` (pulled in by certbot), so a normal install is unaffected, but the optional ``requirements-azure-storage.txt`` / ``requirements-storage-all.txt`` extras were silently relying on that transitive constraint. Make the floor explicit so a layered install onto a stripped-down base image fails loudly with a clear pip resolver error rather than an opaque runtime ImportError.
|
Heads-up: v2.4.12 (PR #147) landed a few minutes ago and touched Could you rebase onto current
Everything else (Azure-KV-specific code, the 12 commits you cherry-picked, the suggestions-addressed pass on #118) is untouched by v2.4.12, so the rebase shouldn't surface anything substantive. Thanks again for the patient cleanup work going from #118 to this — it's a substantially easier review with the unrelated 1200 lines stripped out. |
…ificate-objects # Conflicts: # RELEASE_NOTES.md
162cc9a to
e950bb3
Compare
|
Hi @fabriziosalmi, Merge conflicts resolved — I merged the latest All 40 Azure Key Vault certificate-object tests pass, plus the upstream sprint security audit suites (Sprint 1 + 1.5). Zero test regressions. Thanks again for your patience and for the super clear pointers in your review comment — they made the conflict resolution straightforward. No trouble at all, and really appreciate you reviewing this! |
|
Heads-up on the rebase surface: since the earlier rebase ask, v2.4.13 through v2.4.17 have landed on No urgency — the PR stays open until you have time. When you do rebase, the Sprint 1.5 / 1.7 changes most likely to matter for this one are:
Thanks again for the patient cleanup work from #118 → here. |
…abriziosalmi#139) Merge the four upstream releases into the Azure Key Vault Certificate-object branch: - v2.4.17 – bare-list /api/deploy/history (fabriziosalmi#153) + EXTRA_REQUIREMENTS + certbot-dns-azure baked in (fabriziosalmi#155) - v2.4.16 – DiagnosticsSnapshot endpoint + one-click bug-report UI (fabriziosalmi#157) - v2.4.15 – Sprint 1.6 audit polish (ReDoc self-host, settings GET role, per-username rate limit) - v2.4.13 – Sprint 1.5 audit follow-up (auth refactor, download role split, audit wiring) Conflict resolution: - modules/api/resources.py: preserved our StorageAzureKeyVaultBackfill resource + verify_certificate_api_access() probe; incorporated new DiagnosticsSnapshot class and added to return dict. - modules/core/factory.py: adopted upstream ns_diagnostics + DATA_DIR wiring. - modules/web/settings_routes.py: accepted bare-list return contract. - RELEASE_NOTES.md: combined upstream v2.4.13-2.4.17 notes with our Unreleased Azure KV section. All existing tests pass (40 Azure KV + 11 Diagnostics + 4 deploy history).
b6f27c9 to
7c9a41c
Compare
|
Rebased cleanly on top of upstream/main (v2.4.17). All four sprint patches integrated with zero structural conflicts:
What was manually stitched:
Test results on the merge commit (7c9a41c):
tests/test_azure_keyvault_certificate_storage.py::TestBuildPfx::test_round_trip PASSED [ 2%] ============================== 40 passed in 0.56s ============================== → 40 passed
tests/test_diagnostics_snapshot.py::TestSnapshotShape::test_basic_shape PASSED [ 9%] ============================== 11 passed in 0.68s ============================== → 11 passed
tests/test_issue152_deploy_history.py::test_deploy_history_returns_bare_array PASSED [ 25%] ============================== 4 passed in 0.16s =============================== → 4 passed Should be good to go whenever you have review bandwidth. |
|
Subscribing as an interested party in seeing this functionality question from me - what is the benefit of using the my personal use case is we only use certificate objects for certs, so am curious to understand if there will be any benefit to using both mode |
|
Queued for review + merge tonight, in the same window as the broader v3 UI fix pass we have batched. Plan is to run the smoke test plan from the PR body against a real Azure Key Vault (Service Principal with Certificates + Secrets permissions, all three storage modes exercised end-to-end including the Thanks for the clean rebase off the original #118 and for addressing the three non-blocker notes from that review; the diff is exactly the surface area I want to review now. |
Replaces #118 — same feature, rebased clean on top of current
mainso the diff only contains Azure-KV-related changes. The original PR'sgit merge mainhad pulled in ~1200 lines of unrelated work from our fork; this PR contains only the 12 Azure-KV commits cherry-picked onto currentupstream/main.Also addresses all three non-blocker suggestions from @fabriziosalmi's full review on #118.
Summary
Adds a configurable
storage_modeto the Azure Key Vault storage backend so it can persist certificates as nativeCertificateobjects (PKCS12,issuer_name="Unknown") in addition to — or instead of — the existing per-PEM Secrets layout. Native Certificate objects are consumed directly by App Service, Application Gateway, Front Door, API Management and AKS Ingress without manual PFX export.storage_mode: secrets(default) — current per-PEM Secrets layout, fully backwards-compatible.storage_mode: certificate— single Certificate object, metadata stored as tags.storage_mode: both— write to both surfaces; reads compareupdated_onand return the freshest (stale-read detection).A new admin endpoint
POST /api/storage/azure-keyvault/backfill-certificatesimports Certificate objects for domains already stored as Secrets; restricted tostorage_mode='both'so the legacy Secrets are still listed during the walk. Accepts?limit=Nfor paginating large vaults.Changes since #118 review
Suggestions addressed
docs/architecture.md): pulled the by-design Azure behaviour (importing a Certificate object auto-creates a Secret containing the full PFX) out of the permissions-table cell into a prominent Security note under the storage-modes section, so operators sizing RBAC scopes forcertificate/bothmodes actually see it.?limit=Nquery parameter on the backfill endpoint (modules/api/resources.py): caps how many domains are processed per call to avoid Azure Key Vault transaction throttling on vaults with hundreds of legacy certs. Response now includes aremainingcount so operators can paginate by calling repeatedly; each call is idempotent because already-imported domains are reported asskipped.cryptography>=41.0.0floor in storage extras (requirements-azure-storage.txt,requirements-storage-all.txt): the Certificate-object code path usescryptography.x509.load_pem_x509_certificates(plural) and the PKCS12 builder. Mainrequirements.txtalready pinscryptography==46.0.7via certbot, but the optional extras were silently relying on that transitive constraint. Making the floor explicit makes a layered install fail loudly with a pip resolver error instead of an opaque runtimeImportError.Commits in this PR (12)
Diff stat (was 3413 / 202 across 30+ files on #118)
Test plan
pytest tests/test_azure_keyvault_certificate_storage.py -v— 40/40 pass (2 skip cleanly whenazure-keyvault-certificatesis not installed).storage_mode, confirm it migrates tosecretssilently.storage_mode=both; press Test connection — both Secrets and Certificate API access verified.cert-…-{cert,chain,fullchain,privkey}-pem-{crc}AND acert-{domain}-{crc}Certificate object (with domain/dns_provider/staging tags) both appear.storage_mode=certificate; issue a new cert — only the Certificate object is created, no Secrets.GET /api/certificates/{domain}— returns the four PEMs reconstructed from the PFX export.DELETE /api/certificates/{domain}— both surfaces removed inbothmode.bothwith legacy Secrets-only domains, press Backfill Certificate objects — every legacy domain gets an imported Certificate object; already-imported ones are reported asskipped. Pass?limit=10on a large vault — first 10 get processed, response reportsremaining: N-10.Notes for reviewers
storage_mode='secrets'means existing installs see no behaviour change after upgrade.azure-keyvault-certificatesSDK is in the optional storage extras (requirements-azure-storage.txt); CI does not install it by default, so the two SDK-touching tests skip cleanly underpytest.importorskip.🤖 Generated with Claude Code