Problem
Today's kerno doctor output is a flat list of findings. That's good — but every competing tool (Datadog, Pixie, Cilium Hubble, Inspektor Gadget) also produces a flat list of findings. The differentiator promised in the brand positioning is:
"Datadog shows the dashboard. Kerno tells you the diagnosis."
A diagnosis is causal. It looks like:
[14:32:01.000] Disk fsync p99 spiked to 240ms (was 8ms)
│
▼
[14:32:03.500] PostgreSQL write latency p99 climbed to 1.8s (write_xlog blocked on fsync)
│
▼
[14:32:05.200] payment-api request p99 hit 6s; first 504s emitted
│
▼
[14:32:06.100] kube-proxy retransmit rate hit 4% (downstream queue saturation)
Root cause: storage controller (sda) experienced an unusual fsync stall.
Recommended action: investigate /sys/block/sda for IO errors; consider
moving WAL to a less contended device.
Every signal here is already in Signals; today we just don't sequence them.
Goal
Build the causal timeline engine: take the []Finding plus the time-bucketed signal history (the engine.history ring buffer, currently 10 entries) and produce a Timeline that orders findings by their first-detection timestamp and links them by suspected causation.
Files to add
internal/doctor/timeline.go — the engine
internal/doctor/timeline_test.go — table-driven tests with synthetic Signals showing known cascades
internal/doctor/render_timeline.go — renderer extension for the box-tree view above
Approach
- Signal history: bump
engine.maxHistory from 10 to ~120 (10 minutes at 5s intervals — bounded, ~1MB max).
- Timeline construction: for each finding, find the earliest
Signals snapshot in history where the underlying metric crossed its threshold. That's the "fired at" timestamp.
- Causation links: a small rule table — disk-saturation → DB-slow → upstream-API → cascade is one row; OOMKill ← memory-pressure-imminent is another. Match by signal pair + temporal ordering (cause must precede effect).
- Confidence: each link has 0–1 confidence; only links above 0.7 render in the timeline.
Acceptance criteria
Why this matters
This is the shareable feature. A screenshot of a 4-hop causal timeline going viral on r/sre is worth 10x the OOMKill detection. It's also the foundation for Phase 18.1 (shareable incident links) and the kubectl-kerno demo.
Effort
~5 days. Significant: history store, link engine, renderer, AI prompt update, tests.
References
- Phase 14.2 in TODO.md (the "headline feature")
- Brand positioning:
BRAND_TODO.md (gitignored, in repo) — "Datadog shows, Kerno tells"
Problem
Today's
kerno doctoroutput is a flat list of findings. That's good — but every competing tool (Datadog, Pixie, Cilium Hubble, Inspektor Gadget) also produces a flat list of findings. The differentiator promised in the brand positioning is:A diagnosis is causal. It looks like:
Every signal here is already in
Signals; today we just don't sequence them.Goal
Build the causal timeline engine: take the
[]Findingplus the time-bucketed signal history (theengine.historyring buffer, currently 10 entries) and produce aTimelinethat orders findings by their first-detection timestamp and links them by suspected causation.Files to add
internal/doctor/timeline.go— the engineinternal/doctor/timeline_test.go— table-driven tests with synthetic Signals showing known cascadesinternal/doctor/render_timeline.go— renderer extension for the box-tree view aboveApproach
engine.maxHistoryfrom 10 to ~120 (10 minutes at 5s intervals — bounded, ~1MB max).Signalssnapshot in history where the underlying metric crossed its threshold. That's the "fired at" timestamp.Acceptance criteria
kerno doctor --timelinerenders the box-tree viewkerno doctor --output jsonincludes atimelinearray with each link annotated by{cause, effect, confidence, gap_ms}chaos cascadeinscripts/verify.sh— induce the disk→TCP→memory→CPU cascade and assert the timeline links them in order--aiis set, the AI summary references the timeline by ID, not by raw finding nameWhy this matters
This is the shareable feature. A screenshot of a 4-hop causal timeline going viral on r/sre is worth 10x the OOMKill detection. It's also the foundation for Phase 18.1 (shareable incident links) and the kubectl-kerno demo.
Effort
~5 days. Significant: history store, link engine, renderer, AI prompt update, tests.
References
BRAND_TODO.md(gitignored, in repo) — "Datadog shows, Kerno tells"