Skip to content

Doctor: causal timeline & cross-signal correlation (the headline differentiator) #26

Description

@btwshivam

Problem

Today's kerno doctor output is a flat list of findings. That's good — but every competing tool (Datadog, Pixie, Cilium Hubble, Inspektor Gadget) also produces a flat list of findings. The differentiator promised in the brand positioning is:

"Datadog shows the dashboard. Kerno tells you the diagnosis."

A diagnosis is causal. It looks like:

[14:32:01.000] Disk fsync p99 spiked to 240ms (was 8ms)
       │
       ▼
[14:32:03.500] PostgreSQL write latency p99 climbed to 1.8s (write_xlog blocked on fsync)
       │
       ▼
[14:32:05.200] payment-api request p99 hit 6s; first 504s emitted
       │
       ▼
[14:32:06.100] kube-proxy retransmit rate hit 4% (downstream queue saturation)

Root cause: storage controller (sda) experienced an unusual fsync stall.
Recommended action: investigate /sys/block/sda for IO errors; consider
moving WAL to a less contended device.

Every signal here is already in Signals; today we just don't sequence them.

Goal

Build the causal timeline engine: take the []Finding plus the time-bucketed signal history (the engine.history ring buffer, currently 10 entries) and produce a Timeline that orders findings by their first-detection timestamp and links them by suspected causation.

Files to add

  • internal/doctor/timeline.go — the engine
  • internal/doctor/timeline_test.go — table-driven tests with synthetic Signals showing known cascades
  • internal/doctor/render_timeline.go — renderer extension for the box-tree view above

Approach

  1. Signal history: bump engine.maxHistory from 10 to ~120 (10 minutes at 5s intervals — bounded, ~1MB max).
  2. Timeline construction: for each finding, find the earliest Signals snapshot in history where the underlying metric crossed its threshold. That's the "fired at" timestamp.
  3. Causation links: a small rule table — disk-saturation → DB-slow → upstream-API → cascade is one row; OOMKill ← memory-pressure-imminent is another. Match by signal pair + temporal ordering (cause must precede effect).
  4. Confidence: each link has 0–1 confidence; only links above 0.7 render in the timeline.

Acceptance criteria

  • kerno doctor --timeline renders the box-tree view
  • kerno doctor --output json includes a timeline array with each link annotated by {cause, effect, confidence, gap_ms}
  • Pair with chaos cascade in scripts/verify.sh — induce the disk→TCP→memory→CPU cascade and assert the timeline links them in order
  • Unit tests cover: linear cascade, parallel non-related findings, single-finding case, no findings case
  • AI integration: when --ai is set, the AI summary references the timeline by ID, not by raw finding name

Why this matters

This is the shareable feature. A screenshot of a 4-hop causal timeline going viral on r/sre is worth 10x the OOMKill detection. It's also the foundation for Phase 18.1 (shareable incident links) and the kubectl-kerno demo.

Effort

~5 days. Significant: history store, link engine, renderer, AI prompt update, tests.

References

  • Phase 14.2 in TODO.md (the "headline feature")
  • Brand positioning: BRAND_TODO.md (gitignored, in repo) — "Datadog shows, Kerno tells"

Metadata

Metadata

Assignees

Labels

P0Highest priority — ship-blocker or major differentiatorarea/doctorDiagnostic engine and rulesclaimedSomeone is actively working on this (auto-released after 10d inactivity)enhancementNew feature or request

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions