From 66ed1154811abd379bc072e5d3a4820ca2887a6c Mon Sep 17 00:00:00 2001 From: jagilber Date: Tue, 24 Mar 2026 13:35:27 -0400 Subject: [PATCH 1/4] Add Linux cert rotation TSG with complete ARM/PS commands and troubleshooting - Add full ARM JSON snippets for all 6 rotation steps (VMSS secrets, extension settings, SF cluster resource, swap, remove) - Add PowerShell commands (Add/Remove-AzServiceFabricClusterCertificate, New-AzResourceGroupDeployment) - Fix cert format docs: clarify .crt/.prv (waagent) vs .crt/.key (KV extension) vs .pem - Fix ClusterManifest locations: document both TempClusterManifest.xml and ClusterManifest.current.xml - Fix recovery Option 2: replace incorrect systemctl stop/start servicefabric with pkill sfbootstrapagent/FabricHost + walinuxagent restart - Fix recovery Option 1: add prereqs note about az CLI availability, add curl+managed identity alternative - Add InfrastructureManifest.xml and Settings.xml updates to manual recovery - Add Key Vault VM extension docs for common-name cert auto-rollover - Add Managing Azure Resources reference link --- ...ertificate Rotation and Troubleshooting.md | 663 ++++++++++++++++++ 1 file changed, 663 insertions(+) create mode 100644 Security/Service Fabric Linux Cluster Certificate Rotation and Troubleshooting.md diff --git a/Security/Service Fabric Linux Cluster Certificate Rotation and Troubleshooting.md b/Security/Service Fabric Linux Cluster Certificate Rotation and Troubleshooting.md new file mode 100644 index 00000000..1508da81 --- /dev/null +++ b/Security/Service Fabric Linux Cluster Certificate Rotation and Troubleshooting.md @@ -0,0 +1,663 @@ +# Service Fabric Linux Cluster Certificate Rotation and Troubleshooting + +This article provides Linux-specific steps for certificate rotation and troubleshooting on Azure Service Fabric Linux clusters. For Windows clusters, see [Fix Expired Cluster Certificate Manual Steps](./Fix%20Expired%20Cluster%20Certificate%20Manual%20Steps.md) or [How to add and swap the Secondary Certificate](./Use%20Azure%20Resource%20Explorer%20to%20add%20the%20Secondary%20Certificate.md). + +> [!NOTE] +> The ARM/SFRP-level operations (adding secondary cert, swapping primary/secondary, updating VMSS secrets) are the same for Linux and Windows. The differences are in **on-node certificate format, location, delivery mechanism,** and **troubleshooting**. + +## [Applies To] + +Azure Service Fabric **Linux** clusters (Ubuntu, Red Hat) secured with X.509 certificates declared by thumbprint. + +## [Key Differences: Linux vs Windows] + +| Aspect | Windows | Linux | +|--------|---------|-------| +| **Cert format** | PFX (PKCS#12) in Windows cert store | `.crt` + `.prv` (PEM) files or `.pem` files | +| **Cert location** | `LocalMachine\My` cert store | `/var/lib/waagent/` (waagent-delivered) and `/var/lib/sfcerts/` (SF runtime) | +| **Cert delivery** | CRP/VM Agent writes PFX to cert store | waagent downloads `.crt`/`.prv` files to `/var/lib/waagent/` | +| **Old cert removal** | Old certs remain in store | waagent **removes** old certs on goal state change (incarnation update) | +| **ACL/permissions** | `NETWORK SERVICE` ACL on private key | `sfuser` file ownership/permissions on `.crt`/`.prv` files | +| **Verification tool** | RDP + `certlm.msc` | SSH + `ls` / `openssl` | +| **Event logs** | `Microsoft-ServiceFabric%4Admin.evtx` | `/var/log/syslog` | +| **Agent logs** | Windows Event Log + bootstrap agent | `/var/log/waagent.log` + bootstrap agent logs in extension directory | +| **Extension type** | `ServiceFabricNode` | `ServiceFabricLinuxNode` | +| **Extension location** | `C:\Packages\Plugins\Microsoft.Azure.ServiceFabric.ServiceFabricNode` | `/var/lib/waagent/Microsoft.Azure.ServiceFabric.ServiceFabricLinuxNode-{version}` | + +## [Certificate File Locations on Linux] + +### Waagent-Delivered Certificates (VMSS Secrets) + +When certs are deployed via VMSS `osProfile/secrets` (Key Vault), the Azure Linux Agent (waagent) downloads them to: + +```text +/var/lib/waagent/{THUMBPRINT_UPPERCASE}.crt # Public certificate (PEM format) +/var/lib/waagent/{THUMBPRINT_UPPERCASE}.prv # Private key (PEM format) +``` + +Example: + +```text +/var/lib/waagent/ED815E6241146A1730D6C81F06BD1B5692CC0942.crt +/var/lib/waagent/ED815E6241146A1730D6C81F06BD1B5692CC0942.prv +``` + +> [!IMPORTANT] +> When delivered via VMSS `osProfile/secrets`, waagent uses `.prv` extension for private keys (not `.key`). When using the Key Vault VM extension or manual placement, certificates follow the standard `.crt`/`.key` or single `.pem` format as described in [MS Learn](https://learn.microsoft.com/azure/service-fabric/service-fabric-configure-certificates-linux). The SF bootstrap agent copies/links certs from `/var/lib/waagent/` into `/var/lib/sfcerts/` for the SF runtime. + +### Service Fabric Runtime Certificates + +The SF runtime expects certificates in: + +```text +/var/lib/sfcerts/ # Maps to LocalMachine\My on Windows +``` + +All files must be in PEM format. Service Fabric expects either a `.pem` file containing both certificate and private key, or a `.crt` file with the certificate and a `.key` file with the private key (per [MS Learn](https://learn.microsoft.com/azure/service-fabric/service-fabric-configure-certificates-linux#location-and-format-of-x509-certificates-on-linux-nodes)). + +### Key Vault VM Extension Certificates + +If the Key Vault VM extension is installed (recommended for common-name cert declarations with auto-rollover), certs go to: + +```text +/var/lib/waagent/Microsoft.Azure.KeyVault.Store/ # KV extension managed certs +``` + +The Key Vault VM extension delivers certs in `.pem` + `.key` format and supports version-less URIs for automatic certificate renewal. See [Manage certificates in Service Fabric clusters](https://learn.microsoft.com/azure/service-fabric/cluster-security-certificate-management) for the recommended auto-rollover pattern. + +### Extension Configuration and Logs + +```text +/var/lib/waagent/Microsoft.Azure.ServiceFabric.ServiceFabricLinuxNode-{version}/ +├── config/ +│ ├── {N}.settings # Extension settings (contains cert thumbprints) +│ └── {N}.status # Extension status +├── heartbeat.log # Node health heartbeat +├── status/ # Status files +└── ServiceFabricLinuxExtension_install.log +``` + +### Cluster Manifest + +There are two locations with manifest files: + +```text +# Extension-staged manifest (temporary, used during bootstrap) +/var/log/azure/Microsoft.Azure.ServiceFabric.ServiceFabricLinuxNode/TempClusterManifest.xml + +# Runtime manifest (authoritative, used by running SF node) +{DataRoot}/{NodeName}/Fabric/ClusterManifest.current.xml +# Example: /mnt/sfroot/_sys_0/Fabric/ClusterManifest.current.xml +``` + +> [!NOTE] +> `TempClusterManifest.xml` is a staging file used by the extension during node bootstrap. The authoritative runtime manifest is `ClusterManifest.current.xml` under the data root. When troubleshooting, check both - they should contain the same thumbprints. + +### Bootstrap Agent Logs + +```text +/var/log/azure/Microsoft.Azure.ServiceFabric.ServiceFabricLinuxNode/sfbootstrapagent_{PID}.log +``` + +## [Certificate Rotation Steps for Linux Clusters] + +The ARM/SFRP-level operations are the same for Linux and Windows. See [How to add and swap the Secondary Certificate using Azure Portal](./Use%20Azure%20Resource%20Explorer%20to%20add%20the%20Secondary%20Certificate.md) for the full Windows walkthrough. The steps below include the complete ARM JSON and PowerShell commands with Linux-specific notes. + +### Step 1 - Create a New Certificate + +Create or obtain a new certificate and upload it to Key Vault. Options: + +- Create with any reputable CA +- Generate self-signed certs using Azure Portal -> Key Vault +- Create and upload using PowerShell - [CreateKeyVaultAndCertificateForServiceFabric.ps1](../Scripts/CreateKeyVaultAndCertificateForServiceFabric.ps1) + +### Step 2 - Deploy New Cert to VMSS (osProfile/secrets) + +Add the new Key Vault secret URL to the VMSS `osProfile/secrets/vaultCertificates` array. Use [Resource Explorer](https://portal.azure.com/#view/Microsoft_Azure_Resources/ResourceManagerBlade/~/resourceexplorer) to navigate to the VMSS resource, then use [API Playground](https://portal.azure.com/#view/Microsoft_Azure_Resources/ResourceManagerBlade/~/armapiplayground) to PUT the updated configuration. For detailed instructions, see [Managing Azure Resources](../Deployment/managing-azure-resources.md). + +If the new certificate is in the **same Key Vault**, add a new entry to the existing `vaultCertificates` array: + +```json +"virtualMachineProfile": { + "osProfile": { + "secrets": [ + { + "sourceVault": { + "id": "/subscriptions/{sub-id}/resourceGroups/{rg}/providers/Microsoft.KeyVault/vaults/{vault-name}" + }, + "vaultCertificates": [ + { + "certificateUrl": "https://{vault-name}.vault.azure.net/secrets/{old-cert-name}/{version}", + "certificateStore": "My" + }, + { + "certificateUrl": "https://{vault-name}.vault.azure.net/secrets/{new-cert-name}/{version}", + "certificateStore": "My" + } + ] + } + ] + } +} +``` + +If the certificate is in a **different Key Vault**, add a separate entry to the `secrets` array: + +```json +"secrets": [ + { + "sourceVault": { + "id": "/subscriptions/{sub-id}/resourceGroups/{rg}/providers/Microsoft.KeyVault/vaults/{vault-name-1}" + }, + "vaultCertificates": [ + { + "certificateUrl": "https://{vault-name-1}.vault.azure.net/secrets/{old-cert-name}/{version}", + "certificateStore": "My" + } + ] + }, + { + "sourceVault": { + "id": "/subscriptions/{sub-id}/resourceGroups/{rg}/providers/Microsoft.KeyVault/vaults/{vault-name-2}" + }, + "vaultCertificates": [ + { + "certificateUrl": "https://{vault-name-2}.vault.azure.net/secrets/{new-cert-name}/{version}", + "certificateStore": "My" + } + ] + } +] +``` + +**Alternatively, use PowerShell:** + +```powershell +# Add secondary certificate to the cluster (works for both Linux and Windows clusters) +Add-AzServiceFabricClusterCertificate ` + -ResourceGroupName "{resource-group}" ` + -Name "{cluster-name}" ` + -SecretIdentifier "https://{vault-name}.vault.azure.net/secrets/{cert-name}/{version}" +``` + +**Or deploy via ARM template** (see [MS Learn - Add a secondary certificate using Azure Resource Manager](https://learn.microsoft.com/azure/service-fabric/service-fabric-cluster-security-update-certs-azure#add-a-secondary-certificate-using-azure-resource-manager)): + +```powershell +# Deploy ARM template with new certificate parameters +New-AzResourceGroupDeployment ` + -ResourceGroupName "{resource-group}" ` + -TemplateFile "{path-to-template.json}" ` + -TemplateParameterFile "{path-to-parameters.json}" +``` + +Wait for the VMSS `provisioningState` to show `Succeeded` before proceeding. + +> [!NOTE] +> On Linux, `"certificateStore": "My"` maps to `/var/lib/sfcerts/`. However, waagent delivers the cert files as `.crt`/`.prv` to `/var/lib/waagent/` regardless of this setting. The SF bootstrap agent searches `/var/lib/waagent/` for cert files. + +### Step 3 - Verify Certificate Delivery on Nodes + +**This is where Linux differs from Windows.** Instead of RDP + certlm.msc: + +1. **SSH** into a node (or use Serial Console / Run Command in Azure Portal) + +2. **Check waagent.log for cert download confirmation:** + + ```bash + grep -i "download" /var/log/waagent.log | grep -i "cert" + ``` + + Successful download entries look like: + + ```text + INFO ExtHandler Downloaded uris: [ExtHandlerSecretUri(URL=https://myvault.vault.azure.net/secrets/mycert/abc123, StoreLocation=MY, StoreName=MY)] + ``` + +3. **List cert files in waagent directory:** + + ```bash + ls -la /var/lib/waagent/*.crt /var/lib/waagent/*.prv 2>/dev/null + ``` + + You should see `.crt` and `.prv` files for each expected thumbprint. + +4. **Verify cert content matches expected thumbprint:** + + ```bash + openssl x509 -in /var/lib/waagent/{THUMBPRINT}.crt -noout -fingerprint -sha1 + ``` + +5. **Check /var/lib/sfcerts/ directory:** + + ```bash + ls -la /var/lib/sfcerts/ + ``` + +6. **Verify file permissions (sfuser must have read access):** + + ```bash + ls -la /var/lib/waagent/*.crt /var/lib/waagent/*.prv + ``` + +### Step 4 - Add Secondary Cert to VMSS Extension Settings + +Add `certificateSecondary` with the new thumbprint to the VMSS extension settings. Navigate to the VMSS resource in [Resource Explorer](https://portal.azure.com/#view/Microsoft_Azure_Resources/ResourceManagerBlade/~/resourceexplorer), copy the resource URI, and use [API Playground](https://portal.azure.com/#view/Microsoft_Azure_Resources/ResourceManagerBlade/~/armapiplayground) to PUT the updated configuration. + +> [!IMPORTANT] +> On Linux the extension publisher is `Microsoft.Azure.ServiceFabric` and type is **`ServiceFabricLinuxNode`** (not `ServiceFabricNode` as on Windows). + +Modify `virtualMachineProfile / extensionProfile / extensions / settings` to add `certificateSecondary`: + +```json +"virtualMachineProfile": { + "extensionProfile": { + "extensions": [ + { + "properties": { + "autoUpgradeMinorVersion": true, + "settings": { + "clusterEndpoint": "https://{region}.servicefabric.azure.com/runtime/clusters/{clusterid}", + "nodeTypeRef": "NodeType0", + "certificate": { + "thumbprint": "OLD_THUMBPRINT", + "x509StoreName": "My" + }, + "certificateSecondary": { + "thumbprint": "NEW_THUMBPRINT", + "x509StoreName": "My" + } + }, + "publisher": "Microsoft.Azure.ServiceFabric", + "type": "ServiceFabricLinuxNode", + "typeHandlerVersion": "1.1" + }, + "name": "{nodetype}_ServiceFabricLinuxNode" + } + ] + } +} +``` + +**Repeat for each node type** (each VMSS). Execute PUT in API Playground and wait for `provisioningState` to show `Succeeded`. + +### Step 5 - Update SF Cluster Resource + +Add `thumbprintSecondary` to the `Microsoft.ServiceFabric/clusters` resource. Navigate to the SF cluster resource in [Resource Explorer](https://portal.azure.com/#view/Microsoft_Azure_Resources/ResourceManagerBlade/~/resourceexplorer), copy the resource URI, and use [API Playground](https://portal.azure.com/#view/Microsoft_Azure_Resources/ResourceManagerBlade/~/armapiplayground) to PUT the updated configuration: + +```json +{ + "type": "Microsoft.ServiceFabric/clusters", + "properties": { + "certificate": { + "thumbprint": "OLD_THUMBPRINT", + "thumbprintSecondary": "NEW_THUMBPRINT", + "x509StoreName": "My" + } + } +} +``` + +This triggers SFRP to generate an updated ClusterManifest and initiate a cluster upgrade. Wait for `provisioningState` to reach `Succeeded`. This step can take up to an hour. + +### Step 6 - Swap and Remove Old Certificate + +Once Step 5 completes, swap the primary and secondary thumbprints so the new cert becomes primary: + +1. **Swap in each VMSS** (extension settings): + + ```json + "certificate": { + "thumbprint": "NEW_THUMBPRINT", + "x509StoreName": "My" + }, + "certificateSecondary": { + "thumbprint": "OLD_THUMBPRINT", + "x509StoreName": "My" + } + ``` + + Execute PUT in API Playground for each VMSS. Wait for `provisioningState` `Succeeded`. + +2. **Swap in the SF cluster resource**: + + ```json + "certificate": { + "thumbprint": "NEW_THUMBPRINT", + "thumbprintSecondary": "OLD_THUMBPRINT", + "x509StoreName": "My" + } + ``` + + Execute PUT in API Playground. Wait for `provisioningState` `Succeeded`. + +3. **Remove old certificate** (after swap is complete and cluster is healthy): + + Remove the old cert from `vaultCertificates` in the VMSS `osProfile/secrets`, remove `certificateSecondary` from extension settings, and remove `thumbprintSecondary` from the SF cluster resource. + + Or use PowerShell: + + ```powershell + Remove-AzServiceFabricClusterCertificate ` + -ResourceGroupName "{resource-group}" ` + -Name "{cluster-name}" ` + -Thumbprint "OLD_THUMBPRINT" + ``` + +## [Verify Cluster Manifest Alignment] + +This is a critical diagnostic step for both Linux and Windows but is especially important on Linux because waagent removes old certs from disk. + +The correct certificate rotation flow is: + +1. **Customer** updates the `Microsoft.ServiceFabric/clusters` ARM resource with the new certificate thumbprint(s) +2. **SFRP** processes the ARM update and generates an updated ClusterManifest with the new thumbprints +3. **SFRP** triggers a cluster upgrade to roll out the new manifest to all nodes +4. **VMSS extension settings** are updated with the new thumbprints + +> [!IMPORTANT] +> SFRP does **not** independently update the ClusterManifest. The customer must update the SF cluster ARM resource first (via Resource Explorer, API Playground, ARM template, or PowerShell). SFRP generates the ClusterManifest based on the ARM resource definition. If only the VMSS extension settings are updated (e.g., by directly modifying the VMSS resource) without updating the SF cluster resource, the ClusterManifest will be out of sync. + +After a certificate update, verify that the **ClusterManifest Security section matches the extension settings**: + +1. **Check the ClusterManifest for cert thumbprints:** + + ```bash + grep -A2 "X509Credentials\|ClusterCertThumbprints\|ServerCertThumbprints\|ClientCertThumbprints\|Thumbprint" \ + /var/log/azure/Microsoft.Azure.ServiceFabric.ServiceFabricLinuxNode/TempClusterManifest.xml + ``` + +2. **Check the extension settings for cert thumbprints:** + + ```bash + # Find the latest settings file + LATEST_SETTINGS=$(ls -t /var/lib/waagent/Microsoft.Azure.ServiceFabric.ServiceFabricLinuxNode-*/config/*.settings 2>/dev/null | head -1) + cat "$LATEST_SETTINGS" | python3 -m json.tool | grep -i "thumbprint" + ``` + +3. **Compare:** The thumbprints in the ClusterManifest Security section **must match** the thumbprints in the extension settings file. If they don't, the bootstrap agent will fail to start the node. + +## [Troubleshooting Certificate Issues on Linux] + +### Symptom: Nodes Unreachable / Bootstrap Agent Looping + +The SF bootstrap agent (`sfbootstrapagent`) runs in a loop attempting to configure the node. If it cannot find the certificates referenced in the ClusterManifest, it retries indefinitely. + +**Diagnosis steps:** + +1. **Check bootstrap agent logs:** + + ```bash + # Find the latest bootstrap agent log + ls -lt /var/log/azure/Microsoft.Azure.ServiceFabric.ServiceFabricLinuxNode/sfbootstrapagent_*.log | head -5 + + # Check for cert search errors + grep -i "FindKVVMExtCerts\|certificate\|error\|failed" \ + /var/log/azure/Microsoft.Azure.ServiceFabric.ServiceFabricLinuxNode/sfbootstrapagent_*.log | tail -20 + ``` + + Common error pattern: + + ```text + FindKVVMExtCerts: Looking for Thumbprint {OLD_THUMBPRINT} in /var/lib/waagent + FindKVVMExtCerts: ERR: certificate not found + ``` + +2. **Check waagent.log for cert download history:** + + ```bash + grep -E "Download|cert|incarnation|GoalState" /var/log/waagent.log | tail -30 + ``` + + Key things to look for: + - **Incarnation changes**: Each VMSS settings update triggers a new incarnation + - **Certificate downloads**: Confirm new certs were downloaded after the latest incarnation + - **Certificate removals**: Old certs are removed when they leave the goal state + +3. **Check syslog for SF events:** + + ```bash + grep -i "ServiceFabric\|fabric\|certificate" /var/log/syslog | tail -20 + ``` + +4. **Check heartbeat.log:** + + ```bash + EXTENSION_DIR=$(ls -d /var/lib/waagent/Microsoft.Azure.ServiceFabric.ServiceFabricLinuxNode-* 2>/dev/null | head -1) + cat "$EXTENSION_DIR/heartbeat.log" + ``` + + A heartbeat file filled with null bytes indicates the extension has not successfully started. + +### Symptom: VMSS Extension Settings Updated but ClusterManifest Out of Sync + +This is a known failure pattern where the VMSS extension settings have been updated with new certificate thumbprints, but the ClusterManifest still references the old thumbprints. This typically happens when: + +- The VMSS resource was updated directly (e.g., via SFRP backend fix or manual VMSS update) without the SF cluster ARM resource being updated first +- The customer updated only the VMSS but forgot to update the `Microsoft.ServiceFabric/clusters` resource +- The SF cluster ARM update failed or partially completed + +The result: + +- **New certs are on disk** (waagent downloaded them) +- **Old certs are gone** (waagent removed them on incarnation change) +- **Manifest still expects old certs** (ClusterManifest was not regenerated because the SF cluster ARM resource was not updated) +- **Bootstrap agent loops** (looking for old thumbprints that no longer exist as files) + +> [!NOTE] +> The correct flow is: **Customer updates SF cluster ARM resource** → **SFRP generates new ClusterManifest** → **Cluster upgrade rolls out new manifest**. SFRP never independently updates the ClusterManifest. If the VMSS settings and manifest are out of sync, it means the SF cluster ARM resource update was missed or failed. + +**Diagnosis:** + +```bash +# Check which certs are actually on disk +ls /var/lib/waagent/*.crt 2>/dev/null + +# Check which certs the manifest expects +grep "Thumbprint" /var/log/azure/Microsoft.Azure.ServiceFabric.ServiceFabricLinuxNode/TempClusterManifest.xml + +# Check which certs the settings specify +LATEST_SETTINGS=$(ls -t /var/lib/waagent/Microsoft.Azure.ServiceFabric.ServiceFabricLinuxNode-*/config/*.settings 2>/dev/null | head -1) +cat "$LATEST_SETTINGS" | python3 -m json.tool | grep -i "thumbprint" +``` + +**Expected alignment:** + +```text + Manifest Settings On Disk (.crt/.prv) +Old Thumbprint ✓ ✗ ✗ (removed by waagent) ← BROKEN +New Thumbprint ✗ ✓ ✓ (downloaded by waagent) ← BROKEN +``` + +**Resolution:** + +1. **Customer action**: Update the `Microsoft.ServiceFabric/clusters` ARM resource with the correct certificate thumbprint(s) using Resource Explorer / API Playground / ARM template. This will trigger SFRP to generate a new ClusterManifest and roll it out. +2. **If the SF cluster ARM update fails or the cluster is unreachable**: Contact Azure Support. Support can update the SFRP backend record to match the current certificate state, which will allow SFRP to generate the correct ClusterManifest. + +### Symptom: Certificate Present but Bootstrap Agent Can't Find It + +The SF bootstrap agent searches for certificate files in a specific order: + +1. First looks for `/var/lib/waagent/{THUMBPRINT}.crt` + `.prv` files +2. If not found, falls back to scanning all `.pem` files in `/var/lib/waagent/` +3. The `.pem` fallback typically only finds `microsoft_root_certificate.pem` (root CA with no private key) + +**Common causes:** +- Thumbprint case mismatch (manifest has lowercase, files are uppercase) +- File permissions prevent sfuser from reading the cert files +- Certs were delivered to a different directory (KV extension vs waagent) + +**Verify:** + +```bash +# Check exact filenames (thumbprints are uppercase in filenames) +ls -la /var/lib/waagent/*.crt /var/lib/waagent/*.prv 2>/dev/null + +# Check if PEM files exist +ls -la /var/lib/waagent/*.pem 2>/dev/null + +# Check permissions +stat /var/lib/waagent/{THUMBPRINT}.crt +stat /var/lib/waagent/{THUMBPRINT}.prv +``` + +### Symptom: Old Certificate Missing After VMSS Update + +On Linux, when the VMSS goal state changes (new incarnation), waagent **removes** certificate files that are no longer in the current goal state. This is different from Windows, where old certs remain in the cert store. + +**Impact:** If the ClusterManifest still references the old certificate (because the SF cluster ARM resource was not updated), and the old certificate has been removed from disk, the cluster node cannot start. + +**Verify by checking waagent.log timeline:** + +```bash +# See cert download/remove events with timestamps +grep -E "Download.*cert|Remove.*cert|incarnation" /var/log/waagent.log +``` + +## [Linux-Specific Recovery Steps for Expired/Missing Certificates] + +If the cluster is down because of cert issues and cannot be recovered through normal ARM operations: + +> [!IMPORTANT] +> The preferred resolution is always to update the `Microsoft.ServiceFabric/clusters` ARM resource with the correct thumbprint(s), which triggers SFRP to generate the correct ClusterManifest. If the ARM update cannot be applied (e.g., cluster is unreachable, SFRP rejects the update), contact Azure Support to have the SFRP backend record corrected. The options below are emergency workarounds only. + +### Option 1: Manual Certificate Placement (Emergency) + +> [!WARNING] +> This is an emergency workaround. Proper resolution is to update the SF cluster ARM resource or have Azure Support correct the SFRP backend. + +**Prerequisites:** +- SSH access to each node (via Serial Console or Run Command in Azure Portal if SSH endpoint is cert-protected) +- Azure CLI (`az`) may not be installed on SF Linux nodes. If it is not, use an external machine to download the cert and SCP it to the node, or use `curl` with the node's managed identity token to access Key Vault + +1. **SSH** to each node + +2. **Identify which thumbprint the manifest expects:** + + ```bash + grep "Thumbprint" /var/log/azure/Microsoft.Azure.ServiceFabric.ServiceFabricLinuxNode/TempClusterManifest.xml + ``` + +3. **If the cert exists in Key Vault, download it and place as .crt/.prv:** + + ```bash + # Option A: Using az CLI (if installed and authenticated) + az keyvault secret download --vault-name {vault-name} --name {cert-name} --file /tmp/cert.pem + + # Option B: Using curl with managed identity (if VMSS has KV access) + # Get access token + TOKEN=$(curl -s 'http://169.254.169.254/metadata/identity/oauth2/token?api-version=2018-02-01&resource=https%3A%2F%2Fvault.azure.net' -H 'Metadata: true' | python3 -c "import sys,json; print(json.load(sys.stdin)['access_token'])") + # Download secret + curl -s "https://{vault-name}.vault.azure.net/secrets/{cert-name}?api-version=7.4" -H "Authorization: Bearer $TOKEN" | python3 -c "import sys,json; print(json.load(sys.stdin)['value'])" > /tmp/cert.pem + + # Extract certificate and private key + openssl x509 -in /tmp/cert.pem -out /var/lib/waagent/{THUMBPRINT}.crt + openssl pkey -in /tmp/cert.pem -out /var/lib/waagent/{THUMBPRINT}.prv + + # Set permissions + chmod 644 /var/lib/waagent/{THUMBPRINT}.crt + chmod 600 /var/lib/waagent/{THUMBPRINT}.prv + + # Clean up + rm -f /tmp/cert.pem + ``` + +4. **Restart the Azure Linux Agent to trigger the bootstrap agent:** + + ```bash + sudo systemctl restart walinuxagent + ``` + +### Option 2: Update ClusterManifest Manually (Emergency) + +> [!WARNING] +> This is an advanced emergency procedure. The ClusterManifest is normally generated by SFRP based on the SF cluster ARM resource. Manually editing it is a temporary fix that will be overwritten on the next cluster upgrade. Contact Azure Support before attempting. After the cluster is recovered, the SF cluster ARM resource must be updated with the correct thumbprints to make the fix permanent. + +1. **SSH** to each node + +2. **Stop SF processes and the bootstrap agent:** + + > [!NOTE] + > There is no single `servicefabric` systemd service. On Linux, the SF bootstrap agent (part of the `ServiceFabricLinuxNode` extension) launches `FabricHost`, which in turn starts all Fabric processes. The equivalent of the Windows `Stop-Service FabricHostSvc / ServiceFabricNodeBootstrapAgent` sequence is to kill the processes directly. + + ```bash + # Stop the SF bootstrap agent and FabricHost processes + sudo pkill -f sfbootstrapagent || true + sudo pkill -f FabricHost || true + # Wait for processes to exit + sleep 5 + # Verify they are stopped + ps aux | grep -E "sfbootstrapagent|FabricHost|Fabric.exe" | grep -v grep + ``` + +3. **Edit ClusterManifest to replace old thumbprints with new:** + + ```bash + sudo cp /var/log/azure/Microsoft.Azure.ServiceFabric.ServiceFabricLinuxNode/TempClusterManifest.xml \ + /var/log/azure/Microsoft.Azure.ServiceFabric.ServiceFabricLinuxNode/TempClusterManifest.xml.bak + + sudo sed -i 's/OLD_THUMBPRINT/NEW_THUMBPRINT/g' \ + /var/log/azure/Microsoft.Azure.ServiceFabric.ServiceFabricLinuxNode/TempClusterManifest.xml + ``` + +4. **Also update the node-level runtime manifest and settings:** + + ```bash + # Find the data root + # Ubuntu default: /mnt/sfroot + # RedHat default: /mnt/resource/sfroot + DATAROOT="/mnt/sfroot" + if [ ! -d "$DATAROOT" ]; then + DATAROOT="/mnt/resource/sfroot" + fi + + # Update ClusterManifest.current.xml in each node folder + find "$DATAROOT" -name "ClusterManifest.current.xml" -exec sudo cp {} {}.bak \; + find "$DATAROOT" -name "ClusterManifest.current.xml" -exec sudo sed -i 's/OLD_THUMBPRINT/NEW_THUMBPRINT/g' {} \; + + # Update InfrastructureManifest.xml + find "$DATAROOT" -name "InfrastructureManifest.xml" -exec sudo sed -i 's/OLD_THUMBPRINT/NEW_THUMBPRINT/g' {} \; + + # Update Settings.xml in the current Fabric.Config directory + find "$DATAROOT" -path "*/Fabric/Fabric.Config.*/Settings.xml" -exec sudo sed -i 's/OLD_THUMBPRINT/NEW_THUMBPRINT/g' {} \; + ``` + +5. **Restart SF processes:** + + ```bash + # Restart the waagent which will re-trigger the bootstrap agent + sudo systemctl restart walinuxagent + ``` + + The waagent restart triggers the SF extension, which starts the bootstrap agent, which starts FabricHost. + +6. **Repeat on all nodes**, starting with seed nodes. + +## [Quick Reference: Diagnostic Commands] + +| Task | Command | +|------|---------| +| List all certs on node | `ls -la /var/lib/waagent/*.crt /var/lib/waagent/*.prv 2>/dev/null` | +| View cert thumbprint | `openssl x509 -in /var/lib/waagent/{THUMBPRINT}.crt -noout -fingerprint -sha1` | +| View cert expiry | `openssl x509 -in /var/lib/waagent/{THUMBPRINT}.crt -noout -dates` | +| View cert subject | `openssl x509 -in /var/lib/waagent/{THUMBPRINT}.crt -noout -subject` | +| Check manifest thumbprints | `grep "Thumbprint" /var/log/azure/Microsoft.Azure.ServiceFabric.ServiceFabricLinuxNode/TempClusterManifest.xml` | +| Check extension settings | `cat $(ls -t /var/lib/waagent/Microsoft.Azure.ServiceFabric.ServiceFabricLinuxNode-*/config/*.settings \| head -1) \| python3 -m json.tool \| grep thumbprint` | +| Check waagent cert downloads | `grep -i "download.*cert\|cert.*download" /var/log/waagent.log` | +| Check waagent incarnation | `grep "incarnation" /var/log/waagent.log \| tail -5` | +| Check bootstrap agent errors | `grep -i "error\|fail\|FindKVVMExtCerts" /var/log/azure/Microsoft.Azure.ServiceFabric.ServiceFabricLinuxNode/sfbootstrapagent_*.log \| tail -20` | +| Check syslog for SF | `grep -i "ServiceFabric\|fabric" /var/log/syslog \| tail -20` | +| Check heartbeat | `cat /var/lib/waagent/Microsoft.Azure.ServiceFabric.ServiceFabricLinuxNode-*/heartbeat.log` | +| Check extension install log | `cat /var/lib/waagent/Microsoft.Azure.ServiceFabric.ServiceFabricLinuxNode-*/ServiceFabricLinuxExtension_install.log` | +| Check SF data root | `ls /mnt/sfroot/ 2>/dev/null \|\| ls /mnt/resource/sfroot/ 2>/dev/null` | + +## [Additional References] + +- [Certificates and security on Linux clusters (MS Learn)](https://learn.microsoft.com/azure/service-fabric/service-fabric-configure-certificates-linux) +- [Manage certificates in Service Fabric clusters (MS Learn)](https://learn.microsoft.com/azure/service-fabric/cluster-security-certificate-management) +- [Add or remove certificates for a Service Fabric cluster (MS Learn)](https://learn.microsoft.com/azure/service-fabric/service-fabric-cluster-security-update-certs-azure) +- [Managing Azure Resources](../Deployment/managing-azure-resources.md) +- [Service Fabric Ubuntu File Locations](../Cluster/Service%20Fabric%20Ubuntu%20File%20Locations.md) +- [Service Fabric Red Hat File Locations](../Cluster/Service%20Fabric%20Red%20Hat%20File%20Locations.md) +- [Fix Expired Cluster Certificate Manual Steps (Windows)](./Fix%20Expired%20Cluster%20Certificate%20Manual%20Steps.md) +- [How to add and swap the Secondary Certificate (Windows)](./Use%20Azure%20Resource%20Explorer%20to%20add%20the%20Secondary%20Certificate.md) +- [Set up encryption certificate on Linux clusters (MS Learn)](https://learn.microsoft.com/azure/service-fabric/service-fabric-application-secret-management-linux) From f3ec96dc421e3d5c280694554b3d251601c538fc Mon Sep 17 00:00:00 2001 From: jagilber Date: Tue, 31 Mar 2026 20:06:26 -0400 Subject: [PATCH 2/4] fix: correct Linux SF cert rotation TSG based on live cluster testing Tested against Ubuntu 22.04 SF cluster (sfljagilber1lx3) with full cert rotation cycle. Key corrections: - Fix systemd services: servicefabric.service and servicefabricnodebootstrapagent.service DO exist on modern clusters - Fix file permissions: POSIX ACLs (root:root + sfuser ACL), not sfuser ownership - Fix certificateStore: null on Linux, not 'My' - Fix typeHandlerVersion: 2.0, not 1.1 - Fix TempClusterManifest.xml: single-line XML, not updated by config upgrades - Expand /var/lib/sfcerts/ contents (includes .pfx, .pem, transport certs) - Add waagent .pem files note alongside .crt/.prv - Add SFRP does NOT auto-update VMSS extension settings warning - Add Add-AzServiceFabricClusterCertificate deprecation notice (Az 6.0+) - Fix manifest grep commands to use python3 xml pretty-print - Add getfacl commands for ACL verification - Split quick reference manifest check into runtime vs staging --- ...ertificate Rotation and Troubleshooting.md | 92 +++++++++++++++---- 1 file changed, 73 insertions(+), 19 deletions(-) diff --git a/Security/Service Fabric Linux Cluster Certificate Rotation and Troubleshooting.md b/Security/Service Fabric Linux Cluster Certificate Rotation and Troubleshooting.md index 1508da81..06c6ed2f 100644 --- a/Security/Service Fabric Linux Cluster Certificate Rotation and Troubleshooting.md +++ b/Security/Service Fabric Linux Cluster Certificate Rotation and Troubleshooting.md @@ -17,7 +17,7 @@ Azure Service Fabric **Linux** clusters (Ubuntu, Red Hat) secured with X.509 cer | **Cert location** | `LocalMachine\My` cert store | `/var/lib/waagent/` (waagent-delivered) and `/var/lib/sfcerts/` (SF runtime) | | **Cert delivery** | CRP/VM Agent writes PFX to cert store | waagent downloads `.crt`/`.prv` files to `/var/lib/waagent/` | | **Old cert removal** | Old certs remain in store | waagent **removes** old certs on goal state change (incarnation update) | -| **ACL/permissions** | `NETWORK SERVICE` ACL on private key | `sfuser` file ownership/permissions on `.crt`/`.prv` files | +| **ACL/permissions** | `NETWORK SERVICE` ACL on private key | POSIX ACLs granting `sfuser` and `ServiceFabricAdministrators` read/write access (files are owned by `root:root`) | | **Verification tool** | RDP + `certlm.msc` | SSH + `ls` / `openssl` | | **Event logs** | `Microsoft-ServiceFabric%4Admin.evtx` | `/var/log/syslog` | | **Agent logs** | Windows Event Log + bootstrap agent | `/var/log/waagent.log` + bootstrap agent logs in extension directory | @@ -42,6 +42,9 @@ Example: /var/lib/waagent/ED815E6241146A1730D6C81F06BD1B5692CC0942.prv ``` +> [!NOTE] +> Waagent also creates `.pem` files alongside the `.crt`/`.prv` files in `/var/lib/waagent/`. You may see files like `{THUMBPRINT}.pem` in addition to the `.crt` and `.prv` files. + > [!IMPORTANT] > When delivered via VMSS `osProfile/secrets`, waagent uses `.prv` extension for private keys (not `.key`). When using the Key Vault VM extension or manual placement, certificates follow the standard `.crt`/`.key` or single `.pem` format as described in [MS Learn](https://learn.microsoft.com/azure/service-fabric/service-fabric-configure-certificates-linux). The SF bootstrap agent copies/links certs from `/var/lib/waagent/` into `/var/lib/sfcerts/` for the SF runtime. @@ -53,7 +56,21 @@ The SF runtime expects certificates in: /var/lib/sfcerts/ # Maps to LocalMachine\My on Windows ``` -All files must be in PEM format. Service Fabric expects either a `.pem` file containing both certificate and private key, or a `.crt` file with the certificate and a `.key` file with the private key (per [MS Learn](https://learn.microsoft.com/azure/service-fabric/service-fabric-configure-certificates-linux#location-and-format-of-x509-certificates-on-linux-nodes)). +The SF runtime certificate directory contains more files than the waagent source. Typical contents include: + +```text +/var/lib/sfcerts/ + {THUMBPRINT}.crt # Certificate (PEM) + {THUMBPRINT}.prv # Private key (PEM) + {THUMBPRINT}.pem # Combined cert+key (PEM) + {THUMBPRINT}.pfx # PKCS#12 format + Certificates.pem # Aggregated certificate bundle + TransportCert.pem # Transport certificate + TransportPrivate.pem # Transport private key + microsoft_root_certificate.pem # Microsoft root CA +``` + +Service Fabric expects either a `.pem` file containing both certificate and private key, or a `.crt` file with the certificate and a `.key` file with the private key (per [MS Learn](https://learn.microsoft.com/azure/service-fabric/service-fabric-configure-certificates-linux#location-and-format-of-x509-certificates-on-linux-nodes)). ### Key Vault VM Extension Certificates @@ -91,7 +108,10 @@ There are two locations with manifest files: ``` > [!NOTE] -> `TempClusterManifest.xml` is a staging file used by the extension during node bootstrap. The authoritative runtime manifest is `ClusterManifest.current.xml` under the data root. When troubleshooting, check both - they should contain the same thumbprints. +> `TempClusterManifest.xml` is a staging file used by the extension during node bootstrap. It is **not** updated by cluster configuration upgrades - only `ClusterManifest.current.xml` is updated. The authoritative runtime manifest is `ClusterManifest.current.xml` under the data root. When troubleshooting certificate issues on a running cluster, always check `ClusterManifest.current.xml` first. + +> [!TIP] +> `TempClusterManifest.xml` is typically stored as **single-line XML**, so `grep -A2` will not show surrounding context. Use `python3 -m xml.dom.minidom` or `xmllint --format` to pretty-print it before grepping, or use `grep -oP` with regex to extract values. ### Bootstrap Agent Logs @@ -128,11 +148,11 @@ If the new certificate is in the **same Key Vault**, add a new entry to the exis "vaultCertificates": [ { "certificateUrl": "https://{vault-name}.vault.azure.net/secrets/{old-cert-name}/{version}", - "certificateStore": "My" + "certificateStore": null }, { "certificateUrl": "https://{vault-name}.vault.azure.net/secrets/{new-cert-name}/{version}", - "certificateStore": "My" + "certificateStore": null } ] } @@ -152,7 +172,7 @@ If the certificate is in a **different Key Vault**, add a separate entry to the "vaultCertificates": [ { "certificateUrl": "https://{vault-name-1}.vault.azure.net/secrets/{old-cert-name}/{version}", - "certificateStore": "My" + "certificateStore": null } ] }, @@ -163,7 +183,7 @@ If the certificate is in a **different Key Vault**, add a separate entry to the "vaultCertificates": [ { "certificateUrl": "https://{vault-name-2}.vault.azure.net/secrets/{new-cert-name}/{version}", - "certificateStore": "My" + "certificateStore": null } ] } @@ -172,7 +192,11 @@ If the certificate is in a **different Key Vault**, add a separate entry to the **Alternatively, use PowerShell:** +> [!WARNING] +> `Add-AzServiceFabricClusterCertificate` was deprecated in Az PowerShell module 6.0+ and is no longer available. Use ARM templates, Resource Explorer, or API Playground instead. + ```powershell +# DEPRECATED - only works with Az module < 6.0 # Add secondary certificate to the cluster (works for both Linux and Windows clusters) Add-AzServiceFabricClusterCertificate ` -ResourceGroupName "{resource-group}" ` @@ -193,7 +217,7 @@ New-AzResourceGroupDeployment ` Wait for the VMSS `provisioningState` to show `Succeeded` before proceeding. > [!NOTE] -> On Linux, `"certificateStore": "My"` maps to `/var/lib/sfcerts/`. However, waagent delivers the cert files as `.crt`/`.prv` to `/var/lib/waagent/` regardless of this setting. The SF bootstrap agent searches `/var/lib/waagent/` for cert files. +> On Linux, `certificateStore` is `null` in the VMSS `osProfile/secrets` JSON (not `"My"` as on Windows). Waagent delivers the cert files as `.crt`/`.prv` to `/var/lib/waagent/` regardless of this setting. The SF bootstrap agent searches `/var/lib/waagent/` for cert files. ### Step 3 - Verify Certificate Delivery on Nodes @@ -236,9 +260,17 @@ Wait for the VMSS `provisioningState` to show `Succeeded` before proceeding. 6. **Verify file permissions (sfuser must have read access):** ```bash + # Check basic permissions and ownership ls -la /var/lib/waagent/*.crt /var/lib/waagent/*.prv + + # Check POSIX ACLs (files are root:root owned but grant access via ACLs) + getfacl /var/lib/waagent/{THUMBPRINT}.crt + getfacl /var/lib/waagent/{THUMBPRINT}.prv ``` + > [!NOTE] + > On SF Linux clusters, cert files are owned by `root:root`. Access for `sfuser` and the `ServiceFabricAdministrators` group is granted via POSIX ACLs, not file ownership. Use `getfacl` to verify the ACL entries. + ### Step 4 - Add Secondary Cert to VMSS Extension Settings Add `certificateSecondary` with the new thumbprint to the VMSS extension settings. Navigate to the VMSS resource in [Resource Explorer](https://portal.azure.com/#view/Microsoft_Azure_Resources/ResourceManagerBlade/~/resourceexplorer), copy the resource URI, and use [API Playground](https://portal.azure.com/#view/Microsoft_Azure_Resources/ResourceManagerBlade/~/armapiplayground) to PUT the updated configuration. @@ -269,7 +301,7 @@ Modify `virtualMachineProfile / extensionProfile / extensions / settings` to add }, "publisher": "Microsoft.Azure.ServiceFabric", "type": "ServiceFabricLinuxNode", - "typeHandlerVersion": "1.1" + "typeHandlerVersion": "2.0" }, "name": "{nodetype}_ServiceFabricLinuxNode" } @@ -299,6 +331,9 @@ Add `thumbprintSecondary` to the `Microsoft.ServiceFabric/clusters` resource. Na This triggers SFRP to generate an updated ClusterManifest and initiate a cluster upgrade. Wait for `provisioningState` to reach `Succeeded`. This step can take up to an hour. +> [!IMPORTANT] +> SFRP does **not** automatically update the VMSS extension settings when you update the SF cluster ARM resource. Steps 4 and 5 are independent operations. SFRP only updates the ClusterManifest (which is rolled out via cluster upgrade). You must still manually update the VMSS extension settings (Step 4) to keep them in sync. The on-node extension `.settings` file is only updated when the VMSS instance is reimaged or when a new incarnation triggers the extension. + ### Step 6 - Swap and Remove Old Certificate Once Step 5 completes, swap the primary and secondary thumbprints so the new cert becomes primary: @@ -361,9 +396,18 @@ After a certificate update, verify that the **ClusterManifest Security section m 1. **Check the ClusterManifest for cert thumbprints:** + Check the **runtime** manifest (authoritative) first, then the extension staging manifest: + ```bash - grep -A2 "X509Credentials\|ClusterCertThumbprints\|ServerCertThumbprints\|ClientCertThumbprints\|Thumbprint" \ - /var/log/azure/Microsoft.Azure.ServiceFabric.ServiceFabricLinuxNode/TempClusterManifest.xml + # Runtime manifest (authoritative - updated by config upgrades) + DATAROOT="/mnt/sfroot" + [ ! -d "$DATAROOT" ] && DATAROOT="/mnt/resource/sfroot" + MANIFEST=$(find "$DATAROOT" -name "ClusterManifest.current.xml" -print -quit 2>/dev/null) + python3 -c "import xml.dom.minidom,sys; print(xml.dom.minidom.parse('$MANIFEST').toprettyxml())" | grep -i "thumbprint" + + # Extension staging manifest (NOT updated by config upgrades - only reflects bootstrap state) + # Note: TempClusterManifest.xml is single-line XML, so grep -A2 won't show context + python3 -c "import xml.dom.minidom; print(xml.dom.minidom.parse('/var/log/azure/Microsoft.Azure.ServiceFabric.ServiceFabricLinuxNode/TempClusterManifest.xml').toprettyxml())" | grep -i "thumbprint" ``` 2. **Check the extension settings for cert thumbprints:** @@ -577,12 +621,17 @@ If the cluster is down because of cert issues and cannot be recovered through no 2. **Stop SF processes and the bootstrap agent:** > [!NOTE] - > There is no single `servicefabric` systemd service. On Linux, the SF bootstrap agent (part of the `ServiceFabricLinuxNode` extension) launches `FabricHost`, which in turn starts all Fabric processes. The equivalent of the Windows `Stop-Service FabricHostSvc / ServiceFabricNodeBootstrapAgent` sequence is to kill the processes directly. + > On modern SF Linux clusters, there are two systemd services: `servicefabric.service` (starts FabricHost via `/opt/microsoft/servicefabric/bin/starthost.sh`) and `servicefabricnodebootstrapagent.service` (the bootstrap agent). Use `systemctl` to stop/start them. On older clusters where these systemd services do not exist, fall back to killing processes directly. ```bash - # Stop the SF bootstrap agent and FabricHost processes - sudo pkill -f sfbootstrapagent || true - sudo pkill -f FabricHost || true + # Preferred: use systemctl (modern SF Linux clusters) + sudo systemctl stop servicefabric + sudo systemctl stop servicefabricnodebootstrapagent + + # Fallback: kill processes directly (older clusters without systemd units) + # sudo pkill -f sfbootstrapagent || true + # sudo pkill -f FabricHost || true + # Wait for processes to exit sleep 5 # Verify they are stopped @@ -624,11 +673,15 @@ If the cluster is down because of cert issues and cannot be recovered through no 5. **Restart SF processes:** ```bash - # Restart the waagent which will re-trigger the bootstrap agent - sudo systemctl restart walinuxagent + # Preferred: use systemctl to restart SF services (modern clusters) + sudo systemctl restart servicefabricnodebootstrapagent + sudo systemctl restart servicefabric + + # Alternative: restart waagent which will re-trigger the bootstrap agent + # sudo systemctl restart walinuxagent ``` - The waagent restart triggers the SF extension, which starts the bootstrap agent, which starts FabricHost. + On modern clusters, restarting the systemd services directly is faster and more predictable. Restarting waagent triggers the SF extension, which starts the bootstrap agent, which starts FabricHost. 6. **Repeat on all nodes**, starting with seed nodes. @@ -640,7 +693,8 @@ If the cluster is down because of cert issues and cannot be recovered through no | View cert thumbprint | `openssl x509 -in /var/lib/waagent/{THUMBPRINT}.crt -noout -fingerprint -sha1` | | View cert expiry | `openssl x509 -in /var/lib/waagent/{THUMBPRINT}.crt -noout -dates` | | View cert subject | `openssl x509 -in /var/lib/waagent/{THUMBPRINT}.crt -noout -subject` | -| Check manifest thumbprints | `grep "Thumbprint" /var/log/azure/Microsoft.Azure.ServiceFabric.ServiceFabricLinuxNode/TempClusterManifest.xml` | +| Check manifest thumbprints (runtime) | `python3 -c "import xml.dom.minidom; print(xml.dom.minidom.parse('$(find /mnt/sfroot -name ClusterManifest.current.xml -print -quit 2>/dev/null)').toprettyxml())" \| grep -i thumbprint` | +| Check manifest thumbprints (staging) | `python3 -c "import xml.dom.minidom; print(xml.dom.minidom.parse('/var/log/azure/Microsoft.Azure.ServiceFabric.ServiceFabricLinuxNode/TempClusterManifest.xml').toprettyxml())" \| grep -i thumbprint` | | Check extension settings | `cat $(ls -t /var/lib/waagent/Microsoft.Azure.ServiceFabric.ServiceFabricLinuxNode-*/config/*.settings \| head -1) \| python3 -m json.tool \| grep thumbprint` | | Check waagent cert downloads | `grep -i "download.*cert\|cert.*download" /var/log/waagent.log` | | Check waagent incarnation | `grep "incarnation" /var/log/waagent.log \| tail -5` | From 7c2fa378cababf40025e6ec30bdcc4c6edcbffbf Mon Sep 17 00:00:00 2001 From: jagilber Date: Tue, 31 Mar 2026 20:34:42 -0400 Subject: [PATCH 3/4] add SFRP support contact step and cross-references to Windows TSGs - Add 'After Emergency Recovery: Update SFRP' section after Options 1/2 - Aligns with Manual Steps and Automated Script TSGs that require contacting Microsoft Support for SFRP backend update - Note no Linux equivalent of FixExpiredCert.ps1 - Add missing reference to Fix Expired Cert Automated Script TSG --- ...ster Certificate Rotation and Troubleshooting.md | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/Security/Service Fabric Linux Cluster Certificate Rotation and Troubleshooting.md b/Security/Service Fabric Linux Cluster Certificate Rotation and Troubleshooting.md index 06c6ed2f..d0273714 100644 --- a/Security/Service Fabric Linux Cluster Certificate Rotation and Troubleshooting.md +++ b/Security/Service Fabric Linux Cluster Certificate Rotation and Troubleshooting.md @@ -685,6 +685,18 @@ If the cluster is down because of cert issues and cannot be recovered through no 6. **Repeat on all nodes**, starting with seed nodes. +### After Emergency Recovery: Update SFRP + +> [!IMPORTANT] +> The manual recovery steps above (Options 1 and 2) are **temporary fixes**. The on-node changes will be overwritten on the next cluster upgrade or VMSS reimage. To make the fix permanent: +> +> 1. **Contact Microsoft Support** and request that the SFRP backend record be updated to match the current certificate state. This is required when the cluster was unreachable and the SF cluster ARM resource could not be updated through normal channels. +> 2. Once support confirms the SFRP record is corrected, **update the SF cluster ARM resource** with the correct thumbprint(s) via Resource Explorer / API Playground (see Step 5 above). +> 3. **Update the VMSS extension settings** to match (see Step 4 above). +> 4. **Update the VMSS osProfile/secrets** if needed (see Step 2 above). +> +> This aligns with the Windows recovery TSGs: [Fix Expired Cluster Certificate Manual Steps](./Fix%20Expired%20Cluster%20Certificate%20Manual%20Steps.md) and [Fix Expired Cluster Certificate Automated Script](./Fix%20Expired%20Cluster%20Certificate%20Automated%20Script.md). Note that there is no Linux equivalent of `FixExpiredCert.ps1` -- Linux emergency recovery requires the manual steps above. + ## [Quick Reference: Diagnostic Commands] | Task | Command | @@ -713,5 +725,6 @@ If the cluster is down because of cert issues and cannot be recovered through no - [Service Fabric Ubuntu File Locations](../Cluster/Service%20Fabric%20Ubuntu%20File%20Locations.md) - [Service Fabric Red Hat File Locations](../Cluster/Service%20Fabric%20Red%20Hat%20File%20Locations.md) - [Fix Expired Cluster Certificate Manual Steps (Windows)](./Fix%20Expired%20Cluster%20Certificate%20Manual%20Steps.md) +- [Fix Expired Cluster Certificate Automated Script (Windows)](./Fix%20Expired%20Cluster%20Certificate%20Automated%20Script.md) - [How to add and swap the Secondary Certificate (Windows)](./Use%20Azure%20Resource%20Explorer%20to%20add%20the%20Secondary%20Certificate.md) - [Set up encryption certificate on Linux clusters (MS Learn)](https://learn.microsoft.com/azure/service-fabric/service-fabric-application-secret-management-linux) From d9dfdd7f2ba5d7d01bc9f9d6b6148f21c80853fc Mon Sep 17 00:00:00 2001 From: jagilber Date: Wed, 1 Apr 2026 15:37:44 -0400 Subject: [PATCH 4/4] add: include Linux Certificate Management section in Security README --- Security/README.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/Security/README.md b/Security/README.md index f673ecfb..2c1a2c44 100644 --- a/Security/README.md +++ b/Security/README.md @@ -8,6 +8,9 @@ This contains the Security related TSG's surfaced in the Azure Portal during sup * [FixExpiredCert.ps1](../Scripts/FixExpiredCert.ps1) * [Create a New Self Signed Certificate](./Create%20a%20New%20Self%20Signed%20Certificate.md) +### **Linux Certificate Management** +* [Service Fabric Linux Cluster Certificate Rotation and Troubleshooting](./Service%20Fabric%20Linux%20Cluster%20Certificate%20Rotation%20and%20Troubleshooting.md) + ### **Couldn't add or renew certificate** * [Add-AzureRmServiceFabricClusterCertificate throws error TwoCertificatesToTwoCertificatesNotAllowed](./Add-AzureRmServiceFabricClusterCertificate%20throws%20error%20TwoCertificatesToTwoCertificatesNotAllowed.md) * [Add_New_Cert_To_VMSS.ps1](../Scripts/Add_New_Cert_To_VMSS.ps1)