Skip to content

Latest commit

 

History

History
669 lines (490 loc) · 20.4 KB

File metadata and controls

669 lines (490 loc) · 20.4 KB

Troubleshooting

Audience: anyone hitting an error and looking for the fix. Junior or senior; first-week or veteran. Each entry tells you (a) what the error looks like, (b) why it happens, (c) the exact fix.

Index:

For known issues with no available fix yet, see docs/reference/gap-assessment.md. For full audit-chain repair, see docs/operations/audit-recovery.md.


Authentication errors

AADSTS7000215: Invalid client secret

Looks like:

EnvironmentCredential: Authentication failed: AADSTS7000215:
Invalid client secret provided. Ensure the secret being sent in
the request is the client secret value, not the client secret ID,
for a secret added to app '<guid>'.

Why: the most common Entra ID papercut. In the Azure Portal under App Registrations → your app → Certificates & secrets, the table shows two columns that both look copyable:

Column What it is Goes in .env?
Value The actual secret (a random ~40-char string) ✅ Yes — this is AZURE_CLIENT_SECRET
Secret ID A GUID identifying the secret entry ❌ No

The Value is shown once at creation. If you clicked away, it's gone — you have to create a new client secret.

Fix:

  1. Azure Portal → App Registrations → your app → Certificates & secrets.
  2. New client secret → name + expiry.
  3. Immediately copy the Value column (not Secret ID).
  4. Paste into .env as AZURE_CLIENT_SECRET=<value>.
  5. Re-run contentops doctor --matrix.

Verify it loaded:

$s = $env:AZURE_CLIENT_SECRET
"len=$($s.Length), prefix=$($s.Substring(0,3))..."

A real Value is ~40 characters. A Secret ID is exactly 36 (GUID xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx) — if your length is 36 that's the smoking gun.

Don't paste the actual secret in chat / issue trackers / PRs.

AADSTS700213: No matching federated identity record found

Looks like (in CI logs):

AADSTS700213: No matching federated identity record found for
presented assertion subject 'repo:ORG/REPO:environment:production'.

Why: the App Registration's federated credential subject doesn't match the GitHub OIDC token's subject. GitHub mints the token with subject repo:<org>/<repo>:environment:<env-name> (or ...:ref:refs/heads/main, or ...:pull_request); your App Reg must have a federated credential entry that exactly matches.

Fix:

  1. Azure Portal → App Registrations → your app → Certificates & secrets → Federated credentials.
  2. Add credential → "GitHub Actions deploying Azure resources".
  3. Fill in: org, repo, "Environment" (or "Branch" / "Pull request" depending on which workflow triggered the failure).
  4. Save. The subject is auto-derived from your inputs — it should match the error message verbatim.
  5. Re-run the workflow.

Common pitfalls: typos in org/repo, wrong environment name (case matters), missing entries for both environment:production AND ref:refs/heads/main (workflows that fire on both triggers need both subjects).

401 on workspace_reachable — token rejected

Looks like (in contentops doctor --matrix):

[FAIL] workspace_reachable — GET alertRules returned 401 —
token rejected as unauthenticated. Check `az account show` tenant
context matches tenant.yml.

Why: the token was acquired successfully but ARM/LA rejected it. Either (a) the identity is from a different Entra tenant than the workspace, or (b) DefaultAzureCredential returned a stale cached identity (VSCode account, SharedTokenCache) instead of the one you authenticated with.

Fix:

  1. Check the tenant context:
    az account show --query tenantId
    # compare against config/tenant.yml `tenantId:` field
  2. If they match but the error persists, force the dev-credential chain:
    $env:AZURE_TOKEN_CREDENTIALS = "dev"
    contentops doctor --matrix
  3. If still failing, az logout and az login --tenant <id> explicitly.

403 on workspace_reachable — RBAC missing

Looks like:

[FAIL] workspace_reachable — GET alertRules returned 403 —
authenticated but lacks RBAC on this workspace.

Why: the identity is reaching the workspace correctly but doesn't have permission to read alert rules.

Fix: grant Microsoft Sentinel Contributor on the workspace's resource group to whichever identity is active:

  • Path A (az login as user): grant to your user account.
  • Path B (.env with App Reg secret): grant to the App Reg's service principal.
# Path B example
az role assignment create `
  --role "Microsoft Sentinel Contributor" `
  --assignee "<App-Reg-objectId-or-clientId>" `
  --scope "/subscriptions/<sub>/resourceGroups/<rg>"

Wait 1–2 minutes for RBAC to propagate, then re-run doctor.

403 on Graph (graph_reachable / Defender handler)

Looks like:

GET /security/rules/detectionRules returned 403

Why: the App Registration lacks the CustomDetection.Read.All (or ReadWrite.All) Graph application permission, or admin consent hasn't been granted.

Fix:

  1. Azure Portal → App Registrations → your app → API permissions.
  2. Add a permission → Microsoft Graph → Application permissions → search "CustomDetection" → check CustomDetection.ReadWrite.All.
  3. Click Grant admin consent for .
  4. Wait 1–2 minutes; re-run doctor.

For the navigator / auto-disabled-rules paths that also hit the Log Analytics Query API, Microsoft Sentinel Reader (or Contributor) on the workspace is enough — those use a different token audience (api.loganalytics.io), not Graph.


Configuration errors

Tenant has 2 Sentinel workspaces; specify --role or --workspace

Looks like:

error: Tenant has 2 Sentinel workspaces; specify --role or
--workspace. Available: [law-sentinel (prod), SIT-Workspace
(integration)]

Why: your config/tenant.yml lists multiple workspaces; the command needs to know which one to target.

Fix: pass --role or --workspace:

contentops apply --role prod
contentops apply --workspace SIT-Workspace

For commands that should iterate every workspace of a role (e.g. deploy.yml), the workflow already passes --role prod — single-workspace local runs are the only place you usually hit this.

detections_dir — not a directory: detections

Looks like:

[FAIL] detections_dir — not a directory: detections
[FAIL] detections_parse — detections/ missing

Why: the doctor check is cwd-relative. You're running from somewhere other than the repo root (commonly the config/ subdirectory).

Fix:

cd <repo-root>     # e.g. cd C:\git\SIEMContent
contentops doctor --matrix

Both FAILs become PASSes.

config/detections/ appeared after I ran collect

Why: you ran contentops collect --path detections from config/. The --path argument is cwd-relative, so collect created config/detections/. The real corpus is at the repo root, not under config/.

Fix:

cd <repo-root>
Remove-Item -Recurse -Force .\config\detections\
contentops collect --role prod              # uses default --path detections

Run all CLI commands from the repo root. The only files inside config/ should be tenant.yml, lifecycle.yml, kql_lint_allowlist.yml, etc. — no detections/.

gitignored config/tenant.yml doesn't exist after fresh clone

Why: config/tenant.yml is gitignored on purpose; it carries your tenant identifiers.

Fix: copy the example and edit:

Copy-Item config\tenant.yml.example config\tenant.yml
# Edit: subscriptionId, resourceGroup, workspaceName, tenantId

See docs/operations/tenant-config-modes.md for the supported modes (single-workspace, multi-workspace, Defender-only).


Lint + validate errors

600+ META002–005 errors on a fresh tenant.yml

Looks like:

Lint summary: 152 files scanned, 152 with findings, 1176 finding(s) total.
META rules in strict mode (tenant.policy.scaffoldStrict=true).
608 finding(s) at-or-above severity 'error'.

Why: you set policy.scaffoldStrict: true (explicitly or through an older example file). META002–005 — the authoring fields description / attackDescription / references / falsePositives — are CI-blocking in strict mode. The G24 authoring backlog (51 production rules without these fields) hasn't been written yet.

Fix (default, lenient): since PR #241 the default is lenient. Set the policy block to false (or remove it entirely):

# config/tenant.yml
policy:
  scaffoldStrict: false   # or omit the policy block; both behave the same

Re-run: META002–005 become warnings, exit 0.

Fix (you actually want strict): keep scaffoldStrict: true and fill in the missing fields. Each envelope needs four paragraphs of content (description, attackDescription, references[], falsePositives[]); see docs/reference/envelope-schema.md for the schema.

version-bump check failed

Looks like (in CI):

detections/sentinel_analytic/my-rule.yml:
  before: 0.1.0
  after:  0.1.0
  diff:   present
error: file changed without a version bump

Why: the scripts/check_version_bump.py CI gate refuses any PR that changes an envelope's content without bumping the version: field — silent overwrites would defeat the audit trail.

Fix: bump the version in your envelope:

# detections/sentinel_analytic/my-rule.yml
id: my-rule
version: 0.1.1   # was 0.1.0

Whitespace-only diffs still trip the check — that's intentional (if the diff doesn't matter, revert the cosmetic edit; otherwise bump).

catalog drift: docs/reference/generated-catalog.md disagrees with the regenerated output

Looks like (in CI):

catalog drift: committed docs/reference/generated-catalog.md
disagrees with the regenerated output.

Why: you added a new CLI command, lint rule, workflow, or handler without re-running contentops catalog regenerate. The generated catalog is drift-gated to prevent stale references in docs.

Fix:

contentops catalog regenerate
git add docs/reference/generated-catalog.md
git commit --amend --no-edit --signoff   # or new commit
git push --force-with-lease

detection-docs drift: 156 file(s) under docs/detections/ disagree

Looks like:

docs/detections/sentinel_analytic/my-rule.md is out of sync with
the envelope.

Why: you changed an envelope without re-running contentops detection-docs regenerate. Same drift-gate pattern as the catalog above.

Fix:

contentops detection-docs regenerate
git add docs/detections/
git commit -m "chore(docs): regenerate detection docs" --signoff
git push

New CLI commands found without an e2e capability entry

Looks like (in CI pytest output):

FAILED tests/e2e/test_capability_drift_guard.py::test_registry_matches_click_tree
  New CLI commands found without an e2e capability entry:
    my-new-command

Why: you added a Click command but didn't register it in the e2e capability matrix at tests/e2e/_capabilities.py. The matrix exists to keep new commands from silently bypassing the live integration test suite.

Fix: add the command to INTENTIONALLY_UNCOVERED (with a justification referencing where it's tested instead) OR add a Capability(...) entry to CAPABILITIES and append the path to COVERED_LEAVES. See existing entries in the file for the shape.


Apply errors

apply returned 400: Failed to run the analytics rule query. One of the tables does not exist.

Looks like:

ERROR contentops.handlers.sentinel_analytic: Failed to deploy
sentinel rule aa-azure-vm-...: 400 {"error":{"code":"BadRequest",
"message":"Failed to run the analytics rule query. One of the
tables does not exist."}}

Why: the rule's KQL references a table that's available in one workspace (e.g. prod) but not in another (e.g. integration). Common when a data connector is enabled in prod but not in integration.

Fix (three options):

  1. Enable the connector on the workspace that's missing it (e.g. AzureActivity, SecurityEvent). Permanent fix.
  2. Mark the rule status: experimental locally so integration skips it (env-status gate). Prod keeps deploying it.
  3. Leave it. The rule's existing tenant-side state isn't touched by the failed update — drift will report it as in-sync against the OLD payload. Acceptable for known-bad single failures; track in your runbook.

verified=False / MISMATCH on a Defender rule

Looks like:

my-detection  defender_custom_detection  update  success  MISMATCH

Why: apply succeeded (200 OK on PUT) but the round-trip hash check failed. The local envelope and the immediate post-PUT GET produce different content hashes — usually because Defender's server adds or normalises a field (timestamps, computed IDs).

Fix — diagnose first:

contentops defender-roundtrip-diff <envelope_id>

This shows you exactly which fields differ. If the difference is a server-managed field that should be stripped, file an issue against contentops/handlers/defender_custom_detection.py's _SERVER_FIELDS set. The --raw flag skips the strip step so you can see new server-managed fields the codebase doesn't know about yet.

1 error(s) in apply summary — what's the actual error?

Looks like:

[summary table — every row PASS except one]
1 error(s).

Why: the apply summary at the end is per-row; the actual error text was printed inline during the apply loop. Scroll up in the output to find the ERROR contentops.handlers... line for the failed row.

Fix: if you missed it, the audit log has it:

contentops audit query failures --since 1h

Drift + collect errors

prune --dry-run finds 0 orphans but I expected some

Why: prune only flags rules in the tenant that don't appear in your local YAML. If your local detections/ already matches the tenant (e.g. you just ran collect), there's nothing to prune.

If you wanted to clear the tenant of rules NOT in detections/, that's what prune does. If you wanted the opposite — clear local YAMLs that aren't in the tenant — that's contentops clean or manual rm.

collect --role X says "all in-sync" but the portal shows newer rules

Why: collect returns "in-sync" when the local envelope's hash matches the server's. If a portal user just edited a rule and your local copy was already correct (matching the post-edit state), no change shows.

Fix to force a refresh:

contentops clean --asset sentinel_analytic --yes
contentops collect --role prod --asset sentinel_analytic --full

--clear does both in one step:

contentops collect --role prod --clear

CI gate failures

codespell failed on a domain term

Looks like:

./detections/sentinel_analytic/my-rule.yml:42: tehcnique ==> technique

Why: typo in author-controlled prose. (If it's a real domain term, see below.)

Fix (real typo): fix the spelling in the YAML and recommit.

Fix (false positive — legitimate domain term): add it to ignore-words-list in .codespellrc:

ignore-words-list = iif,te,ans,fpr,...,your-new-term-here

references-check reports broken URLs

Looks like:

broken: 2 of 47 URL(s)
  https://blog.example.com/old-post
    HTTP 404
    in detections/sentinel_analytic/my-rule.yml

Why: a URL in metadata.references[] or metadata.runbookUrl returned 4xx/5xx. The PR-time check (validate.yml) flags only URLs newly added in the diff; the weekly full scan (references-check.yml) catches URL rot over time.

Fix: update or remove the broken URL. If it's a transient CDN issue (e.g. login.microsoftonline.com sometimes redirects oddly), add the substring to the workflow's --allow list.


Fork PR limitations

If you opened a PR from a fork (not a branch in KustoKing/SIEMContent itself), some CI checks will skip or render degraded output. This is intentional — GitHub doesn't mint OIDC tokens for fork PRs, so we can't trust them with tenant credentials.

Check Behaviour on fork PR Why
drift-pr (informational drift comment) Skipped OIDC unavailable; cannot query the tenant.
tuning-impact-preview Posts comment with - for counts OIDC unavailable; --no-workspace-query mode used.
Pre-PR schema refresh in validate.yml Falls through to committed baseline continue-on-error: true; lint still runs against the existing schema.
plan --against-tenant overlay Not exercised in CI on fork PRs Same reason.

What still works on fork PRs:

  • All YAML / Python / metadata lint
  • Pytest, SAST (bandit + semgrep), DCO, SPDX
  • Spelling check, references URL check (outbound HTTP only)
  • Structural validate.yml plan (no API calls)

If you need full signal: push your branch into the base repo (any maintainer can help with this), reopen the PR there.


Workspace + tenant data errors

SentinelHealth returns 0 rows even though rules are firing

Why: SentinelHealth is an opt-in diagnostic data collection on the workspace (opt-in since approximately 2022). If it's not turned on, the table simply has no rows, even though your alert rules are running normally.

Verify the diagnostic:

contentops doctor --matrix
# Look for [WARN] sentinel_health — SentinelHealth returned 0 rows...

Fix: enable the diagnostic per https://learn.microsoft.com/en-us/azure/sentinel/health-audit. Azure Portal → Sentinel → Settings → Diagnostic settings → enable "SentinelHealth". Wait 15–30 minutes for the first rows to appear.

Once enabled, contentops auto-disabled-rules will surface real data instead of empty results.

Defender custom detection skipped on integration apply

Looks like:

env-status filter (gate=integration): 46 asset(s) skipped (allowed:
['deprecated', 'production', 'test'])
  - detection-of-attempts-to-disable-microsoft-defender (status=production [defender:prod-only])
  ...

Why: Defender XDR is tenant-wide — there's no "integration" instance of it like there is for Sentinel. The apply gate marks every defender_* envelope as defender:prod-only so they only deploy on --role prod runs. By design, not a bug.

Fix: none needed. To deploy a Defender custom detection, use contentops apply --role prod (or merge to main and let deploy.yml run).

Integration deploy succeeded but rules don't appear in the portal

Why: Sentinel rules can take 30–60 seconds to surface in the portal after a successful PUT (caching layer). The audit record will say success immediately; the portal UI lags.

Fix: wait, then refresh the portal. If they still don't appear after 5 minutes:

contentops drift --role integration
# Look for 'new' entries — that's what apply created.
# If drift shows 0 new + 0 changed but the portal is empty,
# you've hit a tenant-side caching issue. Try the Azure Sentinel
# REST API directly:
az rest --method GET --url "https://management.azure.com/subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.OperationalInsights/workspaces/<ws>/providers/Microsoft.SecurityInsights/alertRules?api-version=2025-07-01-preview"

Still stuck?

If your error isn't above, open an issue with: command run, full output (redact secrets!), git rev-parse HEAD, and contentops doctor --matrix output.