Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
e4bb427
#draft# readd apim tsg with new steps
jagilber Feb 8, 2025
82f7550
#draft# add script test in progress
jagilber Feb 9, 2025
b2f85dc
sync test script
jagilber Feb 12, 2025
a3099b6
remove test script
jagilber Feb 19, 2025
40721fd
##draft## add domainLableScope important note. add key vault access p…
jagilber Feb 24, 2025
14538b8
modify sfmc template for kvvm, uami
jagilber Feb 26, 2025
e4bb0c1
add sfrpProviderGuid sfmc template parameter for uami role assignment…
jagilber Mar 5, 2025
fec9600
add kvvm information
jagilber Mar 12, 2025
4bb154f
move sfmc templates to azure-samples repo
jagilber Mar 30, 2025
e13fd63
update title
jagilber Mar 31, 2025
d984180
readd base sfmc template
jagilber Apr 6, 2025
b3f5584
ai sonnet 4 review
jagilber Jun 23, 2025
355048e
Merge branch 'master' of https://github.com/Azure/Service-Fabric-Trou…
jagilber Jun 24, 2025
ab62d76
add fqdn name syntax <cluster-name>.<random generated string>.<region…
jagilber Jul 2, 2025
e16ed0d
Merge branch 'master' of https://github.com/Azure/Service-Fabric-Trou…
jagilber Jan 11, 2026
91f473d
docs: update APIM-SFMC guide - autoGeneratedDomainNameLabelScope migr…
jagilber Jan 12, 2026
106097e
docs: update APIM configuration guide for Service Fabric managed clus…
jagilber Jan 14, 2026
a162047
docs: update configuration guide for APIM with new static certificate…
jagilber Jan 14, 2026
fb87304
rename: step 1 - temp rename for case change
jagilber Jan 14, 2026
3d3be53
rename: step 2 - lowercase filename
jagilber Jan 14, 2026
2e85583
security: replace real certificate thumbprint with example value
jagilber Jan 14, 2026
fc925c1
docs: enhance APIM configuration guide with client certificate detail…
jagilber Jan 14, 2026
de18ddb
docs: add guide for configuring Service Fabric managed cluster with s…
jagilber Jan 21, 2026
f486412
docs: update sfmc-connect.ps1 with domainNameLabelScope parameter and…
jagilber Jan 25, 2026
682e2e3
docs: update APIM configuration guide for Service Fabric managed clus…
jagilber Jan 25, 2026
a0e7aef
docs: add DNS troubleshooting section and update cert validation sett…
jagilber Mar 15, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1,279 changes: 1,279 additions & 0 deletions Deployment/how-to-configure-apim-for-service-fabric-managed-cluster.md

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,295 @@
# How to Configure Service Fabric Managed Cluster with Static FQDN (autoGeneratedDomainNameLabelScope)

This document describes how to configure Service Fabric managed clusters with a static FQDN using the `autoGeneratedDomainNameLabelScope` property. This configuration provides a persistent FQDN that survives cluster certificate rotation, enabling stable connectivity for API Management (APIM), CI/CD pipelines, and other external integrations.

## Overview

Service Fabric managed clusters (SFMC) automatically rotate the cluster certificate every 90 days. By default, this rotation changes the cluster's FQDN, breaking external connections that rely on certificate thumbprint validation. The `autoGeneratedDomainNameLabelScope` property solves this by providing a static FQDN that persists across certificate rotations.

**Traditional FQDN format**: `<cluster-name>.<region>.cloudapp.azure.com`
**New static FQDN format**: `<cluster-name>.<generated-hash>.<region>.sfmc.io`

**Example**: `sfmctest1nt31.duhcd6fxcrbbffb0.centralus.sfmc.io`

## Key Benefits

- ✅ **Static FQDN** that persists across resource group redeployments
- ✅ **Automatic certificate rotation** (every 90 days) without FQDN change
- ✅ **No thumbprint dependency** - validate by common name instead
- ✅ **Compatible with APIM, CI/CD pipelines, and monitoring tools**
- ✅ **No manual intervention** required during certificate rotation

## Requirements

- ARM template API version: **2024-06-01-preview** or later (tested with **2024-09-01-preview**)
- PowerShell: **Az.ServiceFabric** module version **3.7.0** or later

> [!NOTE]
> To check your current Az.ServiceFabric version: `Get-Module -Name Az.ServiceFabric -ListAvailable | Select-Object Version`
> To update: `Update-Module -Name Az.ServiceFabric`

## Configuration Methods

### Option 1: New Cluster Deployment (Recommended)

Configure `autoGeneratedDomainNameLabelScope` during initial cluster creation for immediate availability.

**ARM Template**:

```json
{
"apiVersion": "2024-09-01-preview",
"type": "Microsoft.ServiceFabric/managedclusters",
"name": "[parameters('clusterName')]",
"location": "[resourcegroup().location]",
"sku": {
"name": "[parameters('clusterSku')]"
},
"properties": {
"autoGeneratedDomainNameLabelScope": "ResourceGroupReuse",
"dnsName": "[toLower(parameters('clusterName'))]",
"adminUserName": "[parameters('adminUserName')]",
"adminPassword": "[parameters('adminPassword')]",
// ... other properties
}
}
```

**PowerShell**:

```powershell
$parameters = @{
ResourceGroupName = 'TestRG'
Name = 'sfmccluster'
Location = 'centralus'
ClusterSku = 'Standard'
AdminUserName = 'cloudadmin'
AdminPassword = $adminPassword
AutoGeneratedDomainNameLabelScope = 'ResourceGroupReuse'
# ... other parameters
}

New-AzServiceFabricManagedCluster @parameters
```

### Option 2: Update Existing Cluster

For existing clusters without `autoGeneratedDomainNameLabelScope`, you can add this property post-deployment. **Important**: The backend provisioning process can take up to 4 hours to complete, though in practice it may complete much faster.

#### Understanding the Update Process

When updating an existing cluster, `Set-AzServiceFabricManagedCluster` performs the following:

1. **3 Upgrade Domain (UD) Walks**: The operation walks through all 3 upgrade domains to apply the configuration change
2. **Long-Running Operation**: Due to the UD walks and backend periodic job, the operation can exceed standard ARM timeout limits
3. **Automatic Revert on Failure**: If the operation fails or times out, the cluster **automatically reverts** the change to maintain stability
4. **Safe to Retry**: The revert behavior makes it safe to retry immediately after timeout or failure

**Expected Behavior**:
- ✅ ARM operation may timeout before backend job completes (normal)
- ✅ Cluster remains accessible during the entire process
- ✅ Configuration reverts automatically if operation fails
- ⏳ Backend periodic job can take up to 4 hours after successful ARM operation

**Using PowerShell**:

```powershell
# Update existing cluster with autoGeneratedDomainNameLabelScope
$resourceGroupName = 'TestRG'
$clusterName = 'sfmccluster'

Set-AzServiceFabricManagedCluster -ResourceGroupName $resourceGroupName `
-Name $clusterName `
-AutoGeneratedDomainNameLabelScope 'ResourceGroupReuse'

# Note: The command may timeout due to the 4-hour backend job window.
# This is expected behavior - the cluster will revert the change if the operation fails.
# Safe to retry immediately if timeout occurs.
```

**Handling Timeouts**:

```powershell
try {
Set-AzServiceFabricManagedCluster -ResourceGroupName $resourceGroupName `
-Name $clusterName `
-AutoGeneratedDomainNameLabelScope 'ResourceGroupReuse' `
-ErrorAction Stop
Write-Host "✅ Operation completed successfully" -ForegroundColor Green
}
catch {
if ($_.Exception.Message -match "timeout|timed out") {
Write-Host "⚠️ Operation timed out (expected). Retrying..." -ForegroundColor Yellow
Write-Host " Note: Cluster reverted change automatically." -ForegroundColor Cyan

# Retry immediately - safe because cluster reverted
Set-AzServiceFabricManagedCluster -ResourceGroupName $resourceGroupName `
-Name $clusterName `
-AutoGeneratedDomainNameLabelScope 'ResourceGroupReuse'
}
else {
Write-Error $_.Exception.Message
}
}
```

**Using ARM Template**:

```json
{
"apiVersion": "2024-09-01-preview",
"type": "Microsoft.ServiceFabric/managedclusters",
"properties": {
"autoGeneratedDomainNameLabelScope": "ResourceGroupReuse",
// ... other existing properties
}
}
```

> [!NOTE]
> ARM template deployments are subject to the same timeout and UD walk behavior as PowerShell commands. Consider using PowerShell with retry logic for better timeout handling.

## Monitoring FQDN Provisioning

After applying the configuration, monitor the FQDN to determine when the backend job completes:

```powershell
# Check current FQDN
$cluster = Get-AzServiceFabricManagedCluster -ResourceGroupName $resourceGroupName -Name $clusterName
Write-Host "Current FQDN: $($cluster.Fqdn)"

# Check if updated to *.sfmc.io format
if ($cluster.Fqdn -like "*.sfmc.io") {
Write-Host "✅ FQDN successfully updated to static format" -ForegroundColor Green
} else {
Write-Host "⏳ FQDN not yet updated. Backend job still in progress." -ForegroundColor Yellow
Write-Host " Current: $($cluster.Fqdn)"
Write-Host " Wait and check again (can take up to 4 hours)"
}
```

## Verification After Backend Job Completes

```powershell
# The new FQDN format will be: <cluster-name>.<hash>.<region>.sfmc.io
# Example: sfmctest1nt31.duhcd6fxcrbbffb0.centralus.sfmc.io

# Test connectivity to new endpoint
$cluster = Get-AzServiceFabricManagedCluster -ResourceGroupName $resourceGroupName -Name $clusterName
Test-NetConnection -ComputerName $cluster.Fqdn -Port 19000

# Verify FQDN format
if ($cluster.Fqdn -match "^[^.]+\.[^.]+\.[^.]+\.sfmc\.io$") {
Write-Host "✅ FQDN format correct: $($cluster.Fqdn)" -ForegroundColor Green
} else {
Write-Host "⚠️ FQDN format unexpected: $($cluster.Fqdn)" -ForegroundColor Yellow
}
```

## Troubleshooting Slow Provisioning

The `Set-AzServiceFabricManagedCluster` command may timeout due to the 4-hour backend job window and 3 UD walks. **Important**: If the operation fails or times out, the cluster automatically reverts the change. You can retry immediately - don't wait the full 4 hours.

```powershell
# Check if FQDN has changed to *.sfmc.io format
$cluster = Get-AzServiceFabricManagedCluster -ResourceGroupName $resourceGroupName -Name $clusterName
if ($cluster.Fqdn -notlike "*.sfmc.io") {
Write-Host "FQDN not yet updated. Retrying Set-AzServiceFabricManagedCluster..." -ForegroundColor Yellow

# Retry the operation - safe to retry even if less than 4 hours
# If previous operation timed out, cluster already reverted the change
Set-AzServiceFabricManagedCluster -ResourceGroupName $resourceGroupName `
-Name $clusterName `
-AutoGeneratedDomainNameLabelScope 'ResourceGroupReuse'

Write-Host "Retry initiated. Backend periodic job can take up to 4 hours." -ForegroundColor Cyan
Write-Host "Check again periodically or if command times out." -ForegroundColor Cyan
} else {
Write-Host "✅ FQDN successfully updated: $($cluster.Fqdn)" -ForegroundColor Green
}
```

### Common Issues

| Issue | Description | Resolution |
|-------|-------------|------------|
| **"ARM operation times out"** | Set-AzServiceFabricManagedCluster returns timeout after 3 UD walks | **Retry immediately** - cluster automatically reverted change, safe to retry |
| **"Operation appears stuck"** | Command running longer than expected | Normal - 3 UD walks take time. Wait for timeout or completion |
| **"Set-AzServiceFabricManagedCluster fails"** | Command returns error | Cluster reverted automatically. Retry the operation |
| **"FQDN still shows old format"** | Backend job hasn't completed after successful command | Retry Set-AzServiceFabricManagedCluster to re-trigger backend job |
| **"ARM shows property but FQDN unchanged"** | Azure Resource Manager updated but backend pending | Normal - ARM updates immediately, backend job follows (up to 4 hours) |
| **"Multiple retries needed"** | Several attempts before FQDN updates | Normal - backend periodic job timing varies. Keep retrying |

**Key Points**:
- ✅ **Automatic revert on failure** - cluster rolls back change if operation fails
- ✅ **Safe to retry immediately** - revert behavior ensures cluster stability
- ✅ **No need to wait 4 hours** - the 4-hour window is for backend job, not retry wait time
- ✅ **3 UD walks required** - operation touches all upgrade domains (takes time)
- ✅ **Retrying re-triggers the backend job** - helps if periodic job missed first request
- ⏳ **Backend job can take up to 4 hours** after successful ARM operation

## Validated Test Results

**Test Configuration** (January 18, 2026):
- Cluster: `sfmctest1nt31`
- Resource Group: `sfmctest1nt31`
- Location: `centralus`
- Initial FQDN: `sfmctest1nt31.centralus.cloudapp.azure.com`

**Command Executed**:
```powershell
Set-AzServiceFabricManagedCluster -ResourceGroupName "sfmctest1nt31" `
-Name "sfmctest1nt31" `
-AutoGeneratedDomainNameLabelScope 'ResourceGroupReuse'
```

**Results**:
- ✅ ARM operation completed: ~6 minutes
- ✅ FQDN immediately updated: `sfmctest1nt31.duhcd6fxcrbbffb0.centralus.sfmc.io`
- ✅ Connectivity verified: Port 19000 accessible
- ✅ Format validated: `*.sfmc.io` pattern correct

**Key Finding**: While Microsoft documentation states the backend job **can take up to 4 hours**, in this test the FQDN was updated immediately upon command completion. Actual provisioning time may vary by cluster configuration and backend load.

## Important Notes

- **Deployment-time configuration recommended**: For new clusters, configure during initial deployment for immediate availability
- **Property update is immediate**: Azure Resource Manager reflects the change immediately
- **Backend job timing varies**: FQDN provisioning may complete immediately or take up to 4 hours
- **Cluster remains accessible**: Old endpoint continues working during transition period
- **No downtime**: Certificate rotation happens seamlessly with static FQDN

## SFMC Architecture Note

Service Fabric managed clusters create infrastructure resources in a **separate auto-managed resource group** named `SFC_<cluster-id>`. This resource group contains:
- Virtual Machine Scale Sets (VMSS)
- Storage accounts
- Load balancers
- Network security groups
- Virtual networks (if not BYO VNET)

The primary resource group contains only the cluster resource itself and optionally a managed identity. When troubleshooting or monitoring deployments, check both resource groups.

## Use Cases

This configuration is essential for:

1. **Azure API Management (APIM)**: Backend configuration using common name validation instead of thumbprint
2. **CI/CD Pipelines**: Stable endpoint for deployment automation
3. **Monitoring Tools**: Consistent FQDN for metrics collection and alerting
4. **External Applications**: Any system connecting to the cluster that cannot handle FQDN changes
5. **Certificate Management**: Eliminates manual updates when cluster certificate rotates

## Related Documentation

- [How to configure APIM for a Service Fabric managed cluster](./how-to-configure-apim-for-service-fabric-managed-cluster.md)
- [How to Export Service Fabric Managed Cluster Configuration](./how-to-export-service-fabric-managed-cluster-configuration.md)
- [Bring your own virtual network (Microsoft Learn)](https://learn.microsoft.com/azure/service-fabric/how-to-managed-cluster-networking#bring-your-own-virtual-network)
- [Managed TLS Solution (Internal - requires Microsoft authentication)](https://eng.ms/docs/products/azure/service-fabric/how-to/publiccacluster/managed-tls-solution)

## Reference Templates

Example Service Fabric managed cluster ARM templates with domainNameLabel configuration:
- [SF-Managed-Standard-SKU-1-NT-DomainNameLabel](https://github.com/jagilber/service-fabric-cluster-templates/blob/sfmcapim/SF-Managed-Standard-SKU-1-NT-DomainNameLabel/azuredeploy.json)
- [SF-Managed-Standard-SKU-1-NT-DomainNameLabel-KVVM-MI](https://github.com/jagilber/service-fabric-cluster-templates/blob/sfmcapim/SF-Managed-Standard-SKU-1-NT-DomainNameLabel-KVVM-MI/azuredeploy.json)
- [Standard SKU Service Fabric managed cluster, 2 node types, deployed in to existing subnet](https://github.com/Azure-Samples/service-fabric-cluster-templates/tree/master/SF-Managed-Standard-SKU-2-NT-BYOVNET)
67 changes: 49 additions & 18 deletions Scripts/sfmc-connect.ps1
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,9 @@
.EXAMPLE
.\sfmc-connect.ps1 -clusterEndpoint mycluster.eastus.cloudapp.azure.com -commonName *.mycluster.com

.EXAMPLE
.\sfmc-connect.ps1 -clusterEndpoint mycluster.eastus.cloudapp.azure.com -thumbprint ABCD... -domainNameLabelScope

.LINK
invoke-webRequest "https://raw.githubusercontent.com/Azure/Service-Fabric-Troubleshooting-Guides/master/Scripts/sfmc-connect.ps1" -outFile "$pwd\sfmc-connect.ps1";

Expand All @@ -58,7 +61,11 @@ param(

[Parameter(ParameterSetName = 'thumbprint')]
[Parameter(ParameterSetName = 'commonName')]
$clusterendpointPort = 19000
$clusterendpointPort = 19000,

[Parameter(ParameterSetName = 'thumbprint')]
[Parameter(ParameterSetName = 'commonName')]
[switch]$domainNameLabelScope
)

function main() {
Expand Down Expand Up @@ -130,23 +137,47 @@ function main() {
write-host "using server thumbprint:$serverCertThumbprint" -ForegroundColor Cyan
}

write-host "Connect-ServiceFabricCluster -ConnectionEndpoint $clusterEndpoint`:$clusterendpointPort ``
-ServerCertThumbprint $serverCertThumbprint ``
-StoreLocation $storeLocation ``
-StoreName $storeName ``
-X509Credential ``
-FindType $findType ``
-FindValue $findValue ``
-Verbose" -ForegroundColor Green

Connect-ServiceFabricCluster -ConnectionEndpoint "$clusterEndpoint`:$clusterendpointPort" `
-ServerCertThumbprint $serverCertThumbprint `
-StoreLocation $storeLocation `
-StoreName $storeName `
-X509Credential `
-FindType $findType `
-FindValue $findValue `
-verbose
# Extract FQDN for ServerCommonName if using domainNameLabelScope
$clusterFqdn = $clusterEndpoint -replace ':\d+$', ''

if ($domainNameLabelScope) {
write-host "Connect-ServiceFabricCluster -ConnectionEndpoint $clusterEndpoint`:$clusterendpointPort ``
-ServerCommonName $clusterFqdn ``
-StoreLocation $storeLocation ``
-StoreName $storeName ``
-X509Credential ``
-FindType $findType ``
-FindValue $findValue ``
-Verbose" -ForegroundColor Green

Connect-ServiceFabricCluster -ConnectionEndpoint "$clusterEndpoint`:$clusterendpointPort" `
-ServerCommonName $clusterFqdn `
-StoreLocation $storeLocation `
-StoreName $storeName `
-X509Credential `
-FindType $findType `
-FindValue $findValue `
-verbose
}
else {
write-host "Connect-ServiceFabricCluster -ConnectionEndpoint $clusterEndpoint`:$clusterendpointPort ``
-ServerCertThumbprint $serverCertThumbprint ``
-StoreLocation $storeLocation ``
-StoreName $storeName ``
-X509Credential ``
-FindType $findType ``
-FindValue $findValue ``
-Verbose" -ForegroundColor Green

Connect-ServiceFabricCluster -ConnectionEndpoint "$clusterEndpoint`:$clusterendpointPort" `
-ServerCertThumbprint $serverCertThumbprint `
-StoreLocation $storeLocation `
-StoreName $storeName `
-X509Credential `
-FindType $findType `
-FindValue $findValue `
-verbose
}
}

function get-clientCert($storeLocation, $storeName, $thumbprint = $null, $commonName = $null) {
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading