Ingero - GPU Causal Observability

Featured in: awesome-ebpf · awesome-observability · awesome-opentelemetry · awesome-sre-tools · awesome-cloud-native · awesome-profiling · Awesome-GPU · awesome-gpu-engineering · awesome-mcp-servers · awesome-devops-mcp-servers · MCP Registry · Glama · mcpservers.org

Version: 0.14.1

The only GPU observability tool your AI assistant can talk to.

"What caused the GPU stall?" → "forward() at train.py:142 - cudaMalloc spiking 48ms during CPU contention. 9,829 calls, 847 scheduler preemptions."

Ingero is a production-grade eBPF agent that traces the full chain - from Linux kernel events through CUDA API calls to your Python source lines - with <2% overhead, zero code changes, and one binary.

Try it in 60 seconds

# Install (Linux amd64; arm64 / Docker / source variants are below)
# ingero-version:install-curl product=ingero channel=stable
VERSION=0.14.1
curl -fsSL "https://github.com/ingero-io/ingero/releases/download/v${VERSION}/ingero_${VERSION}_linux_amd64.tar.gz" | tar xz
sudo mv ingero /usr/local/bin/

# Trace your GPU workload
sudo ingero trace

# Explain what happened (no sudo)
ingero explain --since 5m

No GPU? ingero demo --no-gpu incident runs the full causal-chain diagnosis on synthetic data; no root, no GPU, no ceremony.

After that:

For a single node deployment: check out a real AI investigation session walked through end-to-end on A100 + GH200, see docs/ml_eng_sample_investigation_session.md.
For multi-node cluster deployment (Kubernetes / bare-metal / Docker, fleet-wide straggler detection, cluster architecture), see docs/quickstart_fleet.md.

What you get

Causal chains, not just metrics. Ingero correlates CPU scheduler events, CUDA Runtime API calls, CUDA Driver API calls, CUDA Graph lifecycle, and NCCL collectives by timestamp + PID, then writes a timeline that explains why throughput dropped, not just that it did. ingero explain ships the chain with a fix recommendation.
Driver-level visibility most profilers miss. cuBLAS / cuDNN / torch.compile call cuLaunchKernel directly via the Driver API and bypass the Runtime. Ingero traces both layers, so the kernels other tools never see show up here.
Python source attribution from cudaMalloc. For PyTorch / TensorFlow workloads, Ingero extracts CPython frames from process memory and injects [Python] file.py:line in func() into the stack alongside native frames. CPython 3.10 / 3.11 / 3.12. The userspace walker is the default; an in-kernel eBPF walker handles kernel.yama.ptrace_scope=3 hardened systems.
AI-queryable in one command. ingero mcp exposes a Model Context Protocol server (stdio or HTTPS / TLS 1.3). Your AI assistant asks "what caused the GPU stall?" and gets back a resolved causal chain with Python source lines, no SQL, no hex addresses. Works with Claude Code, Cursor, and any local model via Ollama.
<2% overhead, zero code changes, single binary. GPU-measured: 0.4-1.7% across RTX 3090 through H100 with stack tracing on. No SDK to import, no agent to run alongside your training process, no Helm chart to start with: one curl install, one sudo ingero trace, one ingero explain.

$ sudo ingero trace

  Ingero Trace  -  Live CUDA Event Stream
  Target: PID 4821 (python3)
  Library: /usr/lib/x86_64-linux-gnu/libcudart.so.12
  CUDA probes: 14 attached
  Driver probes: 10 attached
  Host probes: 7 attached

  System: CPU [████████░░░░░░░░░░░░] 47% | Mem [██████████████░░░░░░] 72% (11.2 GB free) | Load 3.2 | Swap 0 MB

  CUDA Runtime API                                               Events: 11,028
  ┌──────────────────────┬────────┬──────────┬──────────┬──────────┬─────────┐
  │ Operation            │ Count  │ p50      │ p95      │ p99      │ Flags   │
  ├──────────────────────┼────────┼──────────┼──────────┼──────────┼─────────┤
  │ cudaLaunchKernel     │ 11,009 │ 5.2 µs   │ 12.1 µs  │ 18.4 µs  │         │
  │ cudaMalloc           │     12 │ 125 µs   │ 2.1 ms   │ 8.4 ms   │ ⚠ p99  │
  │ cudaDeviceSynchronize│      7 │ 684 µs   │ 1.2 ms   │ 3.8 ms   │         │
  └──────────────────────┴────────┴──────────┴──────────┴──────────┴─────────┘

  CUDA Driver API                                                Events: 17,525
  ┌──────────────────────┬────────┬──────────┬──────────┬──────────┬─────────┐
  │ Operation            │ Count  │ p50      │ p95      │ p99      │ Flags   │
  ├──────────────────────┼────────┼──────────┼──────────┼──────────┼─────────┤
  │ cuLaunchKernel       │ 17,509 │ 4.8 µs   │ 11.3 µs  │ 16.2 µs  │         │
  │ cuMemAlloc           │     16 │ 98 µs    │ 1.8 ms   │ 7.1 ms   │         │
  └──────────────────────┴────────┴──────────┴──────────┴──────────┴─────────┘

  Host Context                                                   Events: 258
  ┌─────────────────┬────────┬──────────────────────────────────────────┐
  │ Event           │ Count  │ Detail                                   │
  ├─────────────────┼────────┼──────────────────────────────────────────┤
  │ mm_page_alloc   │    251 │ 1.0 MB allocated (order-0: 251)          │
  │ process_exit    │      7 │ 7 processes exited                       │
  └─────────────────┴────────┴──────────────────────────────────────────┘

  ⚠ cudaStreamSync p99 = 142ms  -  correlated with 23 sched_switch events
    (GPU thread preempted during sync wait, avg 2.1ms off-CPU)

See It In Action

sudo ingero check — system readiness

ingero check verifying kernel, BTF, NVIDIA driver, GPU model, CUDA libraries, and active processes

sudo ingero trace — live event stream

ingero trace showing live CUDA Runtime and Driver API statistics with rolling p50/p95/p99 latencies and host context

ingero explain --since 5m — automated diagnosis

ingero explain producing incident report with causal chains, root cause analysis, and fix recommendations

sudo ingero trace — CUDA Graph lifecycle events

ingero trace showing CUDA Graph capture, instantiate, and launch events alongside CUDA runtime and host events

ingero explain — graph causal chain diagnosis

ingero explain showing causal chain linking CUDA Graph launch to CPU contention with fix recommendations

ingero demo --no-gpu incident — try without a GPU

ingero demo running in synthetic mode without GPU, showing full causal chain diagnosis

ingero demo                 # run all 6 scenarios (auto-detects GPU)
ingero demo incident        # full causal chain in 30 seconds
ingero demo --no-gpu        # synthetic mode (no root, no GPU needed)
sudo ingero demo --gpu      # real GPU + eBPF tracing

Scenarios

Scenario	What It Reveals
`incident`	CPU spike + sched_switch storm → cudaStreamSync 8.5x latency spike → full causal chain with root cause and fix
`cold-start`	First CUDA calls take 50-200x longer than steady state (CUDA context init)
`memcpy-bottleneck`	cudaMemcpy dominates wall-clock time (38%), not compute - nvidia-smi lies
`periodic-spike`	cudaMalloc spikes 50x every ~200 batches (PyTorch caching allocator)
`cpu-contention`	Host CPU preemption causes CUDA latency spikes
`gpu-steal`	Multi-process GPU time-slicing quantified via CUDA API timing patterns

Every scenario prints a GPU auto-detect header showing GPU model and driver version, then displays real-time ASCII bar charts for system context.

This README covers single-node GPU tracing and investigation. For multi-node distributed training diagnostics (fan-out queries across nodes, offline database merge, Perfetto timeline export, clock skew detection), see the Multi-Node Investigation Walkthrough.

Install

Binary Release (recommended)

Download a pre-built binary from GitHub Releases.

Archive filenames include the version: ingero_<version>_linux_<arch>.tar.gz. Replace VERSION below with the latest release (e.g., 0.10.0):

# Linux amd64
# ingero-version:install-archive-amd64 product=ingero channel=stable
VERSION=0.14.1
curl -fsSL "https://github.com/ingero-io/ingero/releases/download/v${VERSION}/ingero_${VERSION}_linux_amd64.tar.gz" | tar xz
sudo mv ingero /usr/local/bin/

# Linux arm64 (GH200, Grace Hopper, Graviton)
# ingero-version:install-archive-arm64 product=ingero channel=stable
VERSION=0.14.1
curl -fsSL "https://github.com/ingero-io/ingero/releases/download/v${VERSION}/ingero_${VERSION}_linux_arm64.tar.gz" | tar xz
sudo mv ingero /usr/local/bin/

Docker Image

Multi-arch images (amd64 + arm64) are published to GHCR on every release:

# Pull the latest image
docker pull ghcr.io/ingero-io/ingero:latest

# Or pin to a specific version
# ingero-version:docker-pull-example product=ingero channel=stable
docker pull ghcr.io/ingero-io/ingero:v0.14.1

# Quick test (no root, no GPU needed)
docker run --rm ghcr.io/ingero-io/ingero demo --no-gpu

# System readiness check
docker run --rm --privileged --pid=host ghcr.io/ingero-io/ingero check

# Live eBPF tracing (requires privileges + kernel mounts)
docker run --rm --privileged --pid=host \
  -v /sys/kernel/debug:/sys/kernel/debug \
  -v /sys/kernel/btf:/sys/kernel/btf:ro \
  -v /var/lib/ingero:/var/lib/ingero \
  ghcr.io/ingero-io/ingero trace --record

Minimum capabilities (alternative to --privileged): --cap-add=BPF --cap-add=PERFMON --cap-add=SYS_ADMIN.

Note: eBPF tracing (trace, demo --gpu) requires --privileged --pid=host plus the kernel volume mounts shown above. Without these, only unprivileged commands work (demo --no-gpu, check, version, explain, query). The --pid=host flag shares the host's /proc - do not also bind-mount -v /proc:/proc:ro as this causes OCI runtime errors on Docker Desktop and WSL2.

Data persistence: The container stores the SQLite database at /var/lib/ingero/ingero.db by default. Mount -v /var/lib/ingero:/var/lib/ingero to persist data after the container stops. Without this mount, all trace data is lost when the container exits.

Multiple databases: Use --db or the INGERO_DB env var to work with different databases:

# Trace to a named database
docker run --rm --privileged --pid=host \
  -v /var/lib/ingero:/var/lib/ingero \
  -v /sys/kernel/debug:/sys/kernel/debug \
  -v /sys/kernel/btf:/sys/kernel/btf:ro \
  ghcr.io/ingero-io/ingero trace --db /var/lib/ingero/training-run-42.db

# Investigate a specific database
docker run --rm \
  -v /var/lib/ingero:/var/lib/ingero \
  ghcr.io/ingero-io/ingero explain --db /var/lib/ingero/training-run-42.db

# Compare databases from different runs
docker run --rm \
  -v /var/lib/ingero:/var/lib/ingero \
  ghcr.io/ingero-io/ingero query --db /var/lib/ingero/training-run-41.db --since 1h

docker run --rm \
  -v /var/lib/ingero:/var/lib/ingero \
  ghcr.io/ingero-io/ingero query --db /var/lib/ingero/training-run-42.db --since 1h

The image is ~10 MB (Alpine 3.20 + statically linked Go binary). When building the dev Dockerfile locally, pass version info via build args:

# ingero-version:docker-build-arg product=ingero channel=stable
docker build -f deploy/docker/Dockerfile \
  --build-arg VERSION=0.14.1 \
  --build-arg COMMIT=$(git rev-parse --short HEAD) \
  --build-arg BUILD_DATE=$(date -u +%Y-%m-%dT%H:%M:%SZ) \
  -t ingero:local .

GHCR images have version info baked in automatically via GoReleaser. See deploy/docker/Dockerfile for details.

Build from Source

# Quick setup: install all build dependencies (Go, clang, llvm) on Ubuntu 22.04/24.04
curl -fsSL https://raw.githubusercontent.com/ingero-io/ingero/main/scripts/install-deps.sh | bash

# Requires clang-14, Linux kernel with BTF
git clone https://github.com/ingero-io/ingero.git
cd ingero
make              # generates eBPF bindings, builds, tests, and lints  -  single command
sudo make install # optional  -  copies binary to /usr/local/bin/ingero
                  # or just use ./bin/ingero directly, or: alias ingero=$PWD/bin/ingero

Requirements

Linux kernel 5.15+ with BTF (CONFIG_DEBUG_INFO_BTF=y)
NVIDIA driver 550+ with CUDA 11.x, 12.x, or 13.x
Root / CAP_BPF + CAP_PERFMON (eBPF requires elevated privileges)
Tested on: GH200, H100, A100, A10, RTX 4090, RTX 3090 (x86_64 and aarch64)

Commands

Command	Purpose
`ingero check`	system readiness (kernel, BTF, driver, GPU, CUDA libs, processes)
`ingero trace`	live event stream + record to SQLite (the only command that needs sudo)
`ingero explain`	causal chains + root cause + fix recommendation from recorded events
`ingero query`	SQL over the local DB; `--nodes` fans out across multi-node fleets
`ingero mcp`	Model Context Protocol server for AI agents (stdio or HTTPS/TLS 1.3)
`ingero dashboard`	browser dashboard backed by SQLite (HTTPS/TLS 1.3)
`ingero merge`	merge SQLite DBs from multiple nodes into one queryable database
`ingero export`	export to Perfetto / Chrome Trace Event Format for timeline UIs
`ingero demo`	6 synthetic scenarios; works without a GPU via `--no-gpu`
`ingero version`	print version

For full flag reference, examples, and per-command output samples, see docs/commands.md. For the AI / MCP integration deep-dive, see docs/commands.md#ingero-mcp.

Stack Tracing

Stack tracing is on by default: every CUDA / Driver API event captures the full userspace call chain. Shows who called cudaMalloc: from the CUDA library up through PyTorch, your Python code, and all the way to main(). GPU-measured overhead is 0.4-0.6% (within noise on RTX 3090 through H100). Disable with --stack=false if needed.

For Python workloads, Ingero extracts CPython frame information directly from process memory (versions 3.10 / 3.11 / 3.12 supported) and injects [Python] source frames into the stack alongside [Native] frames.

See docs/stack_tracing.md for the userspace vs eBPF walker selection, JSON output examples, and kernel.yama.ptrace_scope troubleshooting.

Integrations

Integration	What it does	Reference
MCP server (AI agents)	`ingero mcp` exposes 10 tools + a `/investigate` prompt to any MCP-compatible client (Claude Code, Cursor, Ollama). Stdio or HTTPS / TLS 1.3.	`docs/commands.md#ingero-mcp`
OTLP / Prometheus	`--otlp HOST:PORT` (HTTP JSON) or `--prometheus :PORT` (pull). Standard semantic conventions; compatible with OTel Collector, Grafana Alloy / Cloud, Datadog, New Relic.	`docs/otlp.md`
Browser dashboard	`ingero dashboard` serves an HTTPS dashboard backed by the SQLite trace DB. Live ops / chains / snapshots.	`docs/commands.md#ingero-dashboard`
Multi-node cluster	Real-time cluster-wide straggler detection via the Ingero Fleet collector. Agents push OTLP via `ingero fleet-push`. K8s / bare-metal / Docker.	`docs/quickstart_fleet.md`
Perfetto / Chrome tracing	`ingero export --format perfetto` produces a timeline you open in ui.perfetto.dev or `chrome://tracing`. One track per node / rank.	`docs/commands.md#ingero-export`

Architecture

The agent loads its own eBPF probes onto libcudart.so / libcuda.so uprobes and host tracepoints, drains ringbuffers from user space, assembles causal chains (SYSTEM + HOST + CUDA Runtime + CUDA Driver + CUDA Graph + NCCL), and writes results to local SQLite. The MCP server, Prometheus / OTLP exporter, and HTTPS dashboard all read from that SQLite. Pipeline: discover → attach → capture → correlate → store → serve.

For the full ASCII diagram, per-stage detail, and the cluster-mode (multi-node Fleet) architecture, see docs/architecture.md. For multi-node deployment, see docs/quickstart_fleet.md.

Validated on 6 GPU models (GH200, A100 SXM4, A10, H100 PCIe/SXM5, RTX 4090, RTX 3090) across TensorDock, Lambda Labs, and Azure with x86_64 and aarch64. Stack-on overhead 0.4-1.7% measured.

What Ingero detects (CRITICAL & HIGH)

GPU Problem	Severity	How Ingero detects it
NCCL hangs & distributed-training deadlocks	CRITICAL	Direct `ncclAllReduce` / `ncclSend` / `ncclRecv` uprobes (v0.12.0+) measure per-collective wall time with rank / `comm_id_hash` / `nranks` correlation; `sched_switch` + TCP retransmit as cross-checks.
GPU underutilization / data pipeline starvation	CRITICAL	Host scheduler + `cudaStreamSync` + `cudaMemcpy` pipeline-bubble diagnosis; block-I/O shows DataLoader disk bottleneck.
CUDA OOM & memory fragmentation	CRITICAL	`cudaMalloc` / `cuMemAlloc` allocation patterns; managed-memory over-subscription via `cudaMallocManaged`.
KV-cache pressure & preemption cascades	CRITICAL	`cudaMalloc` patterns + `cudaStreamSync` spikes during preemption; managed-memory page-fault detection.
CPU bottleneck in GPU serving	HIGH	`sched_switch` on inference process + `cudaStreamSync` idle gaps.
Multi-process GPU contention (RAG, multi-tenant)	HIGH	Per-process CUDA API breakdown (`explain --per-process`); per-cgroup `sched_switch`.
GPU memory leaks in long-running services	HIGH	`cudaMalloc` / `cudaFree` imbalance tracking, per-container via cgroup.
GPU scheduling & orchestration failures	HIGH	Per-cgroup `sched_switch` + multi-orchestrator metadata: K8s, Slurm, ECS, Docker / containerd (v0.12.3+).

Full list of detected issues: docs/detections.md

Reference

Topic	Link
Full command reference (all 10 commands, every flag, output samples)	`docs/commands.md`
Architecture (single-node + cluster modes, ASCII diagram, pipeline stages)	`docs/architecture.md`
Multi-node cluster deployment (K8s / bare-metal / Docker)	`docs/quickstart_fleet.md`
Stack tracing deep-dive (Python walker selection, JSON output, ptrace troubleshooting)	`docs/stack_tracing.md`
OTLP / Prometheus integration (transport, metric names, compatibility)	`docs/otlp.md`
Full 25-problem detection catalog	`docs/detections.md`
AI investigation walkthrough (real session on A100 + GH200)	`docs/ml_eng_sample_investigation_session.md`
Fleet `ingero fleet-push` subcommand	`docs/push_fleet.md`
Multi-node remediation protocol (PoC, experimental)	`docs/remediation-protocol_fleet.md`
Test matrix	`docs/test_matrix.md`

License

Ingero is 100% free and open source. Use it for anything - personal, commercial, enterprise, embed it in your product, modify it, redistribute it. No usage restrictions, no phone-home.

Dual-licensed following the standard eBPF split-licensing model (same as Cilium, Falco, and most eBPF projects):

User-Space (Go agent, CLI, causal engine, SQLite, MCP): Apache License 2.0 - maximum enterprise compatibility, no copyleft.
Kernel-Space (eBPF C code in bpf/): GPL-2.0 OR BSD-3-Clause - GPL-2.0 is required by the Linux kernel's BPF subsystem; BSD-3-Clause permits embedding in non-GPL toolchains.

Name		Name	Last commit message	Last commit date
Latest commit History 265 Commits
.github		.github
blog-assets		blog-assets
bpf		bpf
cmd		cmd
configs		configs
deploy		deploy
docs		docs
internal		internal
investigations		investigations
pkg		pkg
scripts		scripts
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
.goreleaser.yaml		.goreleaser.yaml
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
glama.json		glama.json
go.mod		go.mod
go.sum		go.sum
server.json		server.json
straggler-sink		straggler-sink

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ingero - GPU Causal Observability

Try it in 60 seconds

What you get

See It In Action

Scenarios

Install

Binary Release (recommended)

Docker Image

Build from Source

Requirements

Commands

Stack Tracing

Integrations

Architecture

What Ingero detects (CRITICAL & HIGH)

More Reading

Walker roadmap — userspace walker deprecation

Reference

License

About

Uh oh!

Releases 24

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Ingero - GPU Causal Observability

Try it in 60 seconds

What you get

See It In Action

Scenarios

Install

Binary Release (recommended)

Docker Image

Build from Source

Requirements

Commands

Stack Tracing

Integrations

Architecture

What Ingero detects (CRITICAL & HIGH)

More Reading

Walker roadmap — userspace walker deprecation

Reference

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 24

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages