End-to-end Intelligent Operations framework built on Azure Monitor, Microsoft Sentinel, Microsoft Fabric, and ServiceNow — spanning IT infrastructure, applications, OT/IoT, and security.
A reusable, multi-tenant solution accelerator that gives each customer a turnkey observability stack:
- IT Observability: Azure Monitor + Log Analytics + Application Insights (OpenTelemetry)
- OT/IoT Observability: Azure IoT Hub + IoT Edge + Defender for IoT + Microsoft Fabric
- Security Observability: Microsoft Sentinel (SIEM/SOAR/UEBA)
- Workflow Automation: ServiceNow ITSM + Azure Automation + Logic Apps
- AI/Proactive Intelligence: Copilot for Operations, Predictive Maintenance, Data Activator
┌────────────────────────────────────────────────────────────────┐
│ Layer 5: Action & Automation │
│ ServiceNow (ITSM/CMDB) ◄──► Logic Apps ◄──► Automation │
├────────────────────────────────────────────────────────────────┤
│ Layer 4: Visualization │
│ Azure Workbooks │ Power BI │ Managed Grafana │ Sentinel WB │
├────────────────────────────────────────────────────────────────┤
│ Layer 3: Intelligence │
│ Dynamic Thresholds │ Smart Detection │ Copilot │ ML (MTTF) │
│ KQL Anomaly Detection │ Log Clustering │ Data Activator │
├──────────────────────────────┬─────────────────────────────────┤
│ Layer 2a: IT Data Platform │ Layer 2b: OT Data Platform │
│ Log Analytics (Hub-Spoke) │ Microsoft Fabric │
│ + Sentinel │ KQL DB + Lakehouse │
│ + App Insights │ │
├──────────────────────────────┼─────────────────────────────────┤
│ Layer 1a: IT Data Plane │ Layer 1b: OT/IoT Data Plane │
│ AMA + DCR (Azure Policy) │ IoT Edge + IoT Hub │
│ Azure Arc (hybrid) │ Defender for IoT │
│ OpenTelemetry SDK │ OPC-UA / Modbus / MQTT │
│ Diagnostic Settings │ │
└──────────────────────────────┴─────────────────────────────────┘
Each customer gets an isolated spoke workspace. Your managed service hub provides aggregated views via Azure Lighthouse.
┌───────────────────────────────────────────────────────┐
│ MANAGED SERVICE HUB (Your Tenant) │
│ ┌─────────────────────────────────────────────────┐ │
│ │ Hub Log Analytics Workspace │ │
│ │ • Cross-workspace KQL queries │ │
│ │ • Aggregated MTTI/MTTR dashboards │ │
│ │ • Azure Lighthouse delegated access │ │
│ └─────────────────────────────────────────────────┘ │
└──────────────┬──────────────┬──────────────┬──────────┘
│ │ │
┌──────────▼───┐ ┌──────▼───────┐ ┌─▼────────────┐
│ Customer A │ │ Customer B │ │ Customer C │
│ Spoke WS │ │ Spoke WS │ │ Spoke WS │
│ + Sentinel │ │ + Sentinel │ │ + Sentinel │
│ + App Ins. │ │ + App Ins. │ │ + App Ins. │
└──────────────┘ └──────────────┘ └──────────────┘
| Folder | Contents |
|---|---|
accelerator.py |
One-command CLI — discover, provision, generate, deploy |
discovery/ |
Resource scanner (scan_subscription.py), workbook generator (generate_workbooks.py), deployer (deploy_workbooks.py) |
infra/ |
Terraform root config (main.tf, variables.tf, outputs.tf, providers.tf) |
infra/modules/ |
Terraform modules: log-analytics-spoke, sentinel, app-insights, alert-rules, action-groups, amba-alerts, aks-observability, iot-hub, lighthouse, network-observability, workbooks, policy-initiative, diagnostic-settings, ampls, fabric-workspace |
dashboards/workbooks/ |
15 pre-built Azure Workbook JSON templates (VM, K8s, IoT, App, Network, Storage, Key Vault, SQL, Cosmos DB, Logic App, infra health, cost, security, performance, MTTI/MTTR) |
dashboards/powerbi/ |
Power BI templates for OT observability |
automation/runbooks/ |
Azure Automation runbooks for L0 remediation |
automation/logic-apps/ |
Logic App templates for ServiceNow integration |
policies/ |
Azure Policy definitions and initiatives |
docs/ |
Architecture docs, onboarding playbook, instrumentation guides |
templates/ |
Assessment templates (gap analysis, partner matrix) |
generated-workbooks/ |
(auto-created) Customer-specific workbooks from discovery |
The fastest way to deploy. The accelerator auto-discovers your Azure resources, generates dashboards, and deploys everything with a single command.
| Requirement | Details |
|---|---|
| Python | >= 3.10 with pip |
| Azure CLI | Latest version, authenticated (az login) |
| Azure Subscription | With Contributor + User Access Administrator roles |
| Terraform | >= 1.5.0 — only needed for full deploy mode, not discovery-only |
git clone <repo-url>
cd observability-solution-accelerator
# Install Python dependencies
pip install -r discovery/requirements.txt
# Run the accelerator — discovers resources, creates RG & workspace, generates and deploys workbooks
python accelerator.py \
--subscription-id "your-subscription-id" \
--customer-name "contoso" \
--location "westus2" \
--discovery-onlyNo pre-created resources needed. The accelerator auto-creates the resource group (
rg-contoso-obs) and Log Analytics workspace (la-contoso-obs) for you.
This single command will:
- Scan your subscription and inventory all resources (VMs, AKS, IoT, App Services, databases, networking, etc.)
- Detect network posture — identifies publicly accessible resources and private endpoint coverage
- Auto-create the resource group and Log Analytics workspace
- Generate type-specific Azure Workbooks based on discovered resources
- Deploy all generated workbooks to your workspace automatically
Add --resource-group "my-custom-rg" to override the default resource group name.
For the complete stack (Sentinel, alerts, policies, automation), use full deploy mode:
python accelerator.py \
--subscription-id "your-subscription-id" \
--customer-name "contoso" \
--resource-group "rg-contoso-obs" \
--location "westus2"This runs discovery, generates a .tfvars file, and executes terraform plan. Add --auto-approve to deploy without review, or --dry-run to preview all steps.
| Discovered Resources | Auto-Generated Workbook | Terraform Feature |
|---|---|---|
| VMs, VMSS | VM Logs (CPU, memory, disk, events) | enable_amba = true |
| AKS Clusters | K8s Logs (pods, containers, nodes) | enable_aks = true |
| IoT Hubs | IoT Logs (device health, telemetry) | enable_iot_hub = true |
| App Services, Functions | Application Logs (errors, latency, traces) | App Insights apps detected |
| NSGs, VNets, Load Balancers | Network Logs (flows, connectivity) | enable_network_observability = true |
| Storage Accounts | Storage Logs (blob/queue/table operations, latency) | Diagnostic settings |
| Key Vaults | Key Vault Logs (access, secrets, certificates) | Diagnostic settings |
| SQL Databases | SQL Logs (DTU, connections, deadlocks) | Diagnostic settings |
| Cosmos DB Accounts | Cosmos DB Logs (RU consumption, latency, errors) | Diagnostic settings |
| Logic Apps | Logic App Logs (run history, failures, latency) | Diagnostic settings |
| (all resources) | Infrastructure Health (cross-resource health overview) | — |
| (all resources) | Cost & Usage (resource costs, trends) | — |
| (all resources) | App Performance (request rate, response time, failures) | — |
| (Sentinel enabled) | Security Posture (incidents, alerts, compliance) | enable_sentinel = true |
Scan multiple subscriptions in a single run, or auto-discover all subscriptions in your tenant:
# Scan specific subscriptions
python accelerator.py \
--subscription-ids "sub-id-1,sub-id-2,sub-id-3" \
--customer-name "contoso" \
--discovery-only
# Auto-discover and scan all subscriptions in the tenant
python accelerator.py \
--tenant-scan \
--customer-name "contoso" \
--discovery-onlyMulti-subscription mode merges all discovered resources into a single inventory and generates one unified set of workbooks.
The discovery scan automatically checks every resource for:
- Public network access — flags resources with
publicNetworkAccessenabled - Private endpoint coverage — detects whether private endpoints are configured
The output includes a network_posture section per subscription with counts of public-facing vs. private resources. Use this to prioritize security hardening before or alongside the observability rollout.
Recommended: Run
--discovery-onlymode first to get instant workbooks and visibility before committing to a full Terraform deployment.
If you prefer full control, you can manually create a .tfvars file and deploy step by step.
git clone <repo-url>
cd observability-solution-accelerator/infra
cp terraform.tfvars.example <your-company>.tfvarsOpen <your-company>.tfvars and fill in the values below. Only 3 fields are required — everything else has working defaults.
| Setting | Type | Example | Description |
|---|---|---|---|
subscription_id |
string | "a1b2c3d4-..." |
Your Azure subscription ID. Find it with az account show --query id -o tsv |
customer_name |
string | "contoso" |
Lowercase alphanumeric + hyphens only. Used to name all resources (e.g., la-contoso-obs, ai-contoso-webapp) |
resource_group_name |
string | "rg-contoso-obs" |
The resource group to deploy into. Must exist before running terraform apply |
Region & Workspace:
| Setting | Default | Description |
|---|---|---|
location |
"westus2" |
Azure region. All resources deploy here. Alternatives: swedencentral, eastus, etc. |
workspace_sku |
"PerGB2018" |
Log Analytics SKU. Use "CapacityReservation" for high-volume (>100 GB/day) |
workspace_retention_days |
90 |
Log retention in days (30–730). Set higher for compliance (e.g., 365) |
Feature Toggles (turn modules on/off):
| Setting | Default | When to Enable |
|---|---|---|
enable_sentinel |
true |
Disable only if you already have Sentinel elsewhere |
enable_network_observability |
true |
Disable if you don't need NSG flow logs / Traffic Analytics |
enable_iot_hub |
false |
Enable if you have OT/IoT devices (PLCs, SCADA, RTUs) |
enable_amba |
false |
Enable for AMBA service-specific alert packs (SQL, AKS, Databricks, etc.) |
enable_aks |
false |
Enable if you run AKS clusters (requires aks_cluster_id) |
enable_lighthouse |
false |
Enable if managed by an MSP partner (requires tenant/principal IDs) |
Applications — App Insights (one instance per app):
| Setting | Default | Example |
|---|---|---|
app_insights_apps |
[] |
["webapp", "api", "worker"] — creates ai-contoso-webapp, ai-contoso-api, etc. |
Alerts & Notifications:
| Setting | Default | Description |
|---|---|---|
alert_email_recipients |
[] |
Email addresses for alert notifications, e.g., ["noc@contoso.com"] |
servicenow_webhook_uri |
"" |
ServiceNow webhook URL for ITSM integration (leave blank to skip) |
alert_thresholds |
See below | Tune core alert thresholds |
Default alert_thresholds (override any value you want):
alert_thresholds = {
disk_free_percent = 10 # Alert when disk < 10% free
memory_committed_percent = 90 # Alert when memory > 90%
app_exception_count = 50 # Alert when exceptions > 50 per 5min
cpu_anomaly_score = 2.0 # ML anomaly sensitivity (lower = more sensitive)
heartbeat_missing_minutes = 5 # Alert when VM heartbeat lost for 5min
}AMBA Service Alert Packs (requires enable_amba = true):
| Setting | Default | Description |
|---|---|---|
amba_services |
["vm"] |
Services to monitor. Options: vm, sql, appservice, aks, storage, keyvault, eventhub, cosmosdb, databricks, loadbalancer |
amba_thresholds |
See below | Tune per-service alert thresholds |
Default amba_thresholds:
amba_thresholds = {
vm_cpu_percent = 85 # VM CPU warning
vm_disk_iops = 500 # Disk IOPS saturation
sql_dtu_percent = 85 # SQL DTU/vCore utilization
sql_failed_connections = 10 # Failed SQL connections per 15min
appservice_http_5xx_count = 10 # HTTP 5xx errors per 15min
appservice_response_time_sec = 5 # Average response time threshold
aks_node_cpu_percent = 80 # AKS node CPU warning
aks_node_memory_percent = 80 # AKS node memory warning
storage_throttle_count = 10 # Storage throttle events per 15min
cosmosdb_ru_percent = 80 # Cosmos DB RU consumption
}AKS Monitoring (requires enable_aks = true):
| Setting | Default | Description |
|---|---|---|
aks_cluster_id |
"" |
Full resource ID of your AKS cluster, e.g., /subscriptions/.../managedClusters/my-cluster |
IoT Hub (requires enable_iot_hub = true):
| Setting | Default | Description |
|---|---|---|
iot_hub_sku |
"S1" |
IoT Hub tier |
iot_hub_capacity |
1 |
Number of IoT Hub units |
Lighthouse / MSP (requires enable_lighthouse = true):
| Setting | Default | Description |
|---|---|---|
lighthouse_hub_tenant_id |
"" |
Managing partner's Azure AD tenant ID |
lighthouse_hub_principal_id |
"" |
Service principal ID for delegated access |
Network (requires enable_network_observability = true):
| Setting | Default | Description |
|---|---|---|
nsg_ids |
{} |
Map of NSG name → resource ID for flow logs, e.g., { "web-nsg" = "/subscriptions/.../nsgs/web-nsg" } |
flow_log_retention_days |
90 |
Flow log retention in days |
Tags:
| Setting | Default | Description |
|---|---|---|
tags |
{} |
Tags applied to all resources. Example: { environment = "production", cost_center = "IT-OPS" } |
# Login to Azure
az login
# Create the resource group (if it doesn't exist)
az group create --name rg-contoso-obs --location westus2
# Initialize Terraform (downloads providers + modules)
cd infra/
terraform init
# Preview what will be created (no resources deployed yet)
terraform plan -var-file=contoso.tfvars
# Deploy the full observability stack
terraform apply -var-file=contoso.tfvars# Run the smoke test to validate deployment
cd ..
chmod +x tests/smoke-test.sh
./tests/smoke-test.sh rg-contoso-obs contoso --sentinelWith just the 3 required values + defaults, you get:
| Resource | Count | Description |
|---|---|---|
| Log Analytics Workspace | 1 | Central log store with 90-day retention |
| Data Collection Rules | 2 | Windows + Linux performance counters |
| Microsoft Sentinel | 1 | SIEM onboarded to the workspace |
| Action Groups | 3 | Critical, Warning, Automation notification channels |
| Alert Rules | 5 | Heartbeat loss, app errors, disk, memory, CPU anomaly |
| Azure Policy Assignments | 6 | Auto-deploy AMA + associate DCRs to all VMs |
| Network Watcher | 1 | Network observability baseline |
| Azure Workbooks | up to 11 | Auto-generated and deployed from discovery (VM, Network, App, K8s, IoT, Storage, Key Vault, SQL, Cosmos DB, Logic App, infra health, cost, security, performance) |
| Task | How | When |
|---|---|---|
| Deploy Logic Apps | Use ARM templates from automation/logic-apps/ |
If using ServiceNow |
| Register Runbooks | Import .ps1 from automation/runbooks/ into Azure Automation |
For L0 auto-remediation |
| Instrument Apps | Follow guides in docs/instrumentation/ (Java, .NET, Node.js, Python) |
When onboarding apps |
- Minimal (IT-only): See terraform.tfvars.example
- Energy/Industrial (full OT stack): See energy-customer.tfvars
- Alert guidance: See docs/alert-recommendations.md
terraform destroy -var-file=contoso.tfvars
az group delete --name rg-contoso-obs --yesStep-by-step process to onboard a new customer onto the observability platform.
Run the accelerator in discovery-only mode to scan the customer's environment and deploy instant dashboards:
# Single subscription
python accelerator.py \
--subscription-id "customer-sub-id" \
--customer-name "acme" \
--discovery-only
# Or scan all subscriptions in their tenant
python accelerator.py \
--tenant-scan \
--customer-name "acme" \
--discovery-onlyOutput: Resource inventory JSON (discovery-output.json), network posture report, up to 11 workbooks auto-deployed.
- Open the generated workbooks in Azure Portal → Monitor → Workbooks
- Review network posture — prioritize resources with public access enabled
- Check
discovery-output.jsonfor resource categories found and feature flags detected - Share the workbook screenshots with the customer as immediate value
Once the customer approves, deploy the full stack:
python accelerator.py \
--subscription-id "customer-sub-id" \
--customer-name "acme" \
--resource-group "rg-acme-obs" \
--location "westus2" \
--auto-approveThis adds: Sentinel, alert rules, action groups, policies (AMA auto-enrollment), data collection rules, and network observability.
| Integration | How |
|---|---|
| ServiceNow ITSM | Deploy Logic Apps from automation/logic-apps/ → configure webhook URI |
| Auto-Remediation | Import runbooks from automation/runbooks/ → link to action groups |
| Application Monitoring | Instrument apps using guides in docs/instrumentation/ |
| Lighthouse (if MSP) | Enable with enable_lighthouse = true for cross-tenant management |
- Run the smoke test:
./tests/smoke-test.sh rg-acme-obs acme --sentinel - Inject test data:
python tests/inject-fake-data.py - Review alert noise and tune thresholds in
.tfvars - Review MTTI/MTTR workbook for baseline metrics
See docs/onboarding-playbook.md and docs/customer-onboarding.md for detailed guidance.
| Phase | Scope | Duration |
|---|---|---|
| Phase 0 | Build accelerator assets (IaC, dashboards, playbook) — done once | Weeks 1-3 |
| Phase 1 | Discovery & Assessment per customer | Weeks 1-4 |
| Phase 2 | Foundation Deployment (agents, workspaces, pipelines) | Weeks 5-8 |
| Phase 3 | Intelligence & Automation (AI alerts, predictive, ServiceNow) | Weeks 9-12 |
| Phase 4 | Optimization & AI Enablement (tune, consolidate, ROI) | Weeks 13-16 |
See docs/onboarding-playbook.md for the full step-by-step guide.
| KPI | Target |
|---|---|
| MTTI reduction | ≥40% |
| MTTR reduction | ≥40% |
| Auto-remediated incidents | ≥30% |
| Monitoring coverage | 100% of resources |
| Alert noise reduction | ≥50% |
| Component | Service | Role |
|---|---|---|
| Hybrid projection | Azure Arc | Manage on-prem as Azure resources |
| Agent | Azure Monitor Agent (AMA) | Unified data collection via DCRs |
| Logs & Metrics | Log Analytics Workspace | Central query & storage engine |
| APM | Application Insights (workspace-based) | Distributed tracing, app performance |
| Instrumentation | OpenTelemetry + Azure Monitor Exporter | Vendor-neutral app telemetry |
| Security | Microsoft Sentinel | SIEM, SOAR, UEBA |
| OT/IoT | Azure IoT Hub + IoT Edge | Device management, edge compute |
| OT Security | Defender for IoT | Agentless OT network monitoring |
| OT Analytics | Microsoft Fabric (KQL DB + Lakehouse) | Real-time + historical OT analytics |
| Proactive Signals | Fabric Data Activator | Pattern-based triggers (no static alerts) |
| Predictive | Azure ML | MTTF prediction for IoT devices |
| Alerts | Azure Monitor Alert Rules | Dynamic thresholds, log-based, metric |
| Automation | Azure Automation Runbooks | L0/L1 self-healing |
| Orchestration | Logic Apps | ServiceNow ↔ Azure bidirectional workflows |
| ITSM | ServiceNow (ITSMC) | Incident mgmt, CMDB, change mgmt |
| IaC | Terraform | Repeatable deployment |
| Governance | Azure Policy | Enforce monitoring-by-default |
| Multi-tenant | Azure Lighthouse | Cross-customer management |
| Visualization | Azure Workbooks + Managed Grafana | Dashboards & reports |
| AI | Azure Copilot for Operations | NL queries over telemetry |
| Network | Network Watcher + NSG Flow Logs | Network observability |
| Person | Role |
|---|---|
| Andrew Delosky | Lead / Account Strategy |
| Gaurav Bhardwaj | Emerging Tech / Accelerator Build |
| Wilkin Shum | Technical Contributor |
| Pat Lowe | ATU Industry SME |
| Yong/Hong | OT Observability (ATU) |
| Paul | Prior Observability POC |
Mondays 2:30 PM ET — recurring sync