Skip to content

feat: eBPF process lifecycle tracking + ProcessMetadataCache library API#89

Merged
zz85 merged 6 commits into
mainfrom
process_lifecycle
Apr 1, 2026
Merged

feat: eBPF process lifecycle tracking + ProcessMetadataCache library API#89
zz85 merged 6 commits into
mainfrom
process_lifecycle

Conversation

@zz85

@zz85 zz85 commented Mar 31, 2026

Copy link
Copy Markdown
Owner

Add eBPF-driven process exec/exit detection and a ProcessMetadataCache to profile-bee's library API, enabling external profiling agents to get process-aware profiling without polling /proc.

eBPF changes:

  • sched_process_exec tracepoint program (detects execve() calls)
  • lifecycle_tracking_map to broaden exit events for all processes
  • process_exec_events ring buffer for exec notifications

Userspace:

  • ProcessEvent unified type replaces ProcessExitEvent in PerfWork
  • DwarfThreadMsg::ProcessExeced triggers DWARF table reload on exec
  • ProcessMetadataCache: lazy /proc reader with eBPF-driven eviction
  • PointerStackFramesCache::invalidate_pid() for exec cache flush
  • SessionConfig::track_process_lifecycle (auto-on with DWARF)
  • Graceful fallback for older eBPF binaries without new programs

Benefits for profile-bee core:

  • Proactive DWARF table loading on exec (no 1s poll delay)
  • Symbol cache invalidation on exec (no stale resolutions)
  • Immediate DWARF shard reclamation on exit

97 tests pass (9 new for ProcessMetadataCache), clippy -D warnings clean.

Summary by CodeRabbit

  • New Features

    • Process lifecycle tracking: captures exec and exit events to keep process state, trigger symbol/DWARF reloads, and update monitoring reliably.
    • Process metadata cache: on-demand /proc snapshots (cmdline, cwd, env, exe, mount ns) with capacity and automatic eviction; runtime access for enrichment and debugging.
    • Unified process event model and optional system-wide lifecycle mode.
  • Documentation

    • Agent guide updated with lifecycle-tracking, process-metadata cache, and enrichment examples.
  • Tests

    • Unit tests for process-metadata cache (load, invalidate, capacity, env lookup).

@coderabbitai

coderabbitai Bot commented Mar 31, 2026

Copy link
Copy Markdown

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3096aef5-0bf4-4364-8f02-6eb633e619da

📥 Commits

Reviewing files that changed from the base of the PR and between 7d3ca02 and 276964c.

📒 Files selected for processing (1)
  • profile-bee/src/process_metadata.rs

📝 Walkthrough

Walkthrough

eBPF now emits both exec and exit lifecycle events; userspace unifies them into a single ProcessEvent and dispatches EXEC vs EXIT through pipeline, event loop, and DWARF thread. A lazy, capacity-bounded per-PID ProcessMetadataCache and per-PID symbol/frame invalidation were added; session config controls lifecycle tracking and metadata usage.

Changes

Cohort / File(s) Summary
Docs & Session Config
docs/custom_profiling_agent.md, profile-bee/src/session.rs
Documented ProcessMetadataCache usage; added SessionConfig.track_process_lifecycle: bool (default false); lifecycle tracking enables exec tracepoint, sets lifecycle flag in eBPF, and passes enable_process_metadata to the event loop.
Common Event Types
profile-bee-common/src/lib.rs
Added ProcessExecEvent and unified ProcessEvent (with event_type/pid/exit_code); introduced PROCESS_EVENT_EXIT/PROCESS_EVENT_EXEC; deprecated ProcessExitEvent in docs.
eBPF Programs
profile-bee-ebpf/src/lib.rs, profile-bee-ebpf/src/main.rs
Added LIFECYCLE_TRACKING_MAP and lifecycle_tracking_enabled(); new process_exec_events ring buffer and RING_BUF_PROCESS_EXEC; added handle_process_exec and tracepoint_process_exec; exit emission broadened when lifecycle tracking enabled.
Userspace eBPF API
profile-bee/src/ebpf.rs
Added setup_process_exec_ring_buffer, attach_process_exec_tracepoint, and EbpfProfiler::set_lifecycle_tracking to detect optional maps/programs and toggle lifecycle-tracking map value.
Pipeline & Ring-buffer Tasks
profile-bee/src/pipeline.rs
Replaced PerfWork::ProcessExit with PerfWork::ProcessEvent; exit reader now wraps into ProcessEvent; added setup_process_exec_ring_buffer_task; added DwarfThreadMsg::ProcessExeced and exec-driven DWARF remove+reload flow.
Event Loop & Metadata Cache
profile-bee/src/event_loop.rs
Added optional ProcessMetadataCache, pending_exit_pids, and enable_process_metadata flag; drain_events handles ProcessEvent EXIT/EXEC, evicts/invalidate metadata and symbol caches, and exposes evict_pending_exits() and process_metadata() accessors.
Process Metadata Module
profile-bee/src/process_metadata.rs
New module: ProcessMetadata and ProcessMetadataCache with lazy /proc reads, PID-reuse detection via start_time, capacity bounds, environ lookup, get_or_load/get/invalidate/remove, stats, and unit tests.
Per-PID Cache Invalidation
profile-bee/src/cache.rs, profile-bee/src/trace_handler.rs
Added PointerStackFramesCache::invalidate_pid and TraceHandler::invalidate_caches_for_pid to clear per-PID frame/address cache entries on exec.
Binary / Event Handling
profile-bee/bin/profile-bee.rs
Event loop constructed with enable_process_metadata per lifecycle flag; unified handling of PerfWork::ProcessEvent for EXEC/EXIT, forwarding ProcessExeced/ProcessExited messages and updating known_tgids.
Exports
profile-bee/src/lib.rs
Exported process_metadata module for external use.

Sequence Diagram

sequenceDiagram
    actor Kernel
    participant eBPF as "eBPF Programs"
    participant RingBuf as "Ring Buffers"
    participant Pipeline as "Event Pipeline"
    participant EventLoop as "Event Loop"
    participant DwarfThread as "DWARF Thread"
    participant Cache as "ProcessMetadataCache"

    Kernel->>eBPF: sched_process_exec tracepoint
    eBPF->>RingBuf: emit ProcessExecEvent
    RingBuf->>Pipeline: read ProcessExecEvent
    Pipeline->>Pipeline: convert to ProcessEvent (EXEC)
    Pipeline->>EventLoop: send ProcessEvent (EXEC)

    Kernel->>eBPF: sched_process_exit tracepoint
    eBPF->>RingBuf: emit ProcessExitEvent
    RingBuf->>Pipeline: read ProcessExitEvent
    Pipeline->>Pipeline: convert to ProcessEvent (EXIT)
    Pipeline->>EventLoop: send ProcessEvent (EXIT)

    EventLoop->>EventLoop: match event_type

    rect rgba(100,200,100,0.5)
    EventLoop->>DwarfThread: ProcessExeced(pid)
    DwarfThread->>DwarfThread: remove old DWARF state
    DwarfThread->>DwarfThread: reload DWARF for pid
    EventLoop->>EventLoop: invalidate symbol caches
    EventLoop->>Cache: invalidate(pid)
    end

    rect rgba(200,100,100,0.5)
    EventLoop->>DwarfThread: ProcessExited(pid)
    DwarfThread->>DwarfThread: cleanup DWARF state
    EventLoop->>EventLoop: remove from known_tgids
    EventLoop->>Cache: remove(pid)
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Poem

🐰 I nibble on execs and exits with delight,

Caches hop away at each new process sight,
eBPF drums the tune and ring buffers hum,
DWARF folds and reloads — fresh traces come,
The rabbit applauds: "Lifecycle tracking — yum!"

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main feature addition: eBPF process lifecycle tracking combined with a ProcessMetadataCache library API, which aligns with the primary objectives of the pull request.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch process_lifecycle

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Add eBPF-driven process exec/exit detection and a ProcessMetadataCache
to profile-bee's library API, enabling external profiling agents to
get process-aware profiling without polling /proc.

eBPF changes:
- sched_process_exec tracepoint program (detects execve() calls)
- lifecycle_tracking_map to broaden exit events for all processes
- process_exec_events ring buffer for exec notifications

Userspace:
- ProcessEvent unified type replaces ProcessExitEvent in PerfWork
- DwarfThreadMsg::ProcessExeced triggers DWARF table reload on exec
- ProcessMetadataCache: lazy /proc reader with eBPF-driven eviction
- PointerStackFramesCache::invalidate_pid() for exec cache flush
- SessionConfig::track_process_lifecycle (auto-on with DWARF)
- Graceful fallback for older eBPF binaries without new programs

Benefits for profile-bee core:
- Proactive DWARF table loading on exec (no 1s poll delay)
- Symbol cache invalidation on exec (no stale resolutions)
- Immediate DWARF shard reclamation on exit

97 tests pass (9 new for ProcessMetadataCache), clippy -D warnings clean.
@zz85 zz85 force-pushed the process_lifecycle branch from 3884f03 to 14cab1a Compare March 31, 2026 10:35

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
profile-bee-ebpf/src/lib.rs (1)

1429-1443: Consider rate-limiting under extreme fork workloads.

handle_process_exec fires for every execve() system-wide when lifecycle tracking is enabled. On systems with very high fork rates (e.g., build servers, CI runners), this could generate significant event volume.

Current behavior is acceptable for typical profiling use cases:

  • Ring buffer will drop events if full (graceful degradation)
  • Userspace async task drains efficiently
  • DWARF/cache invalidation is idempotent

For production agents on high-churn systems, consider documenting this trade-off or adding optional PID filtering.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@profile-bee-ebpf/src/lib.rs` around lines 1429 - 1443, handle_process_exec
currently emits a ProcessExecEvent for every execve system-wide (using
ctx.tgid() and RING_BUF_PROCESS_EXEC), which can overwhelm high-churn hosts; add
an optional PID filter or simple rate-limiting check inside handle_process_exec
to bail out early (e.g., consult a configurable allowlist/denylist or
token-bucket counter) before reserving/writing to RING_BUF_PROCESS_EXEC so
events for uninteresting PIDs or excessive frequency are skipped—update
ProcessExecEvent emission flow and any config hooks to support
enabling/disabling PID filtering at runtime.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@profile-bee/src/event_loop.rs`:
- Around line 246-253: The code evicts process metadata on PROCESS_EVENT_EXIT
(see PROCESS_EVENT_EXIT branch, DwarfThreadMsg::ProcessExited, self.known_tgids,
self.process_metadata), which loses data for processes that start+exit within
the same collection window; stop removing the PID from ProcessMetadataCache here
— either remove the cache.remove(event.pid) call and let collect()/collect_raw()
consume and evict snapshots, or eagerly populate the cache when a PID is first
seen so eviction on exit is safe; ensure whichever approach you pick updates the
consumer-visible lifecycle (e.g., add a post-collection eviction step or an
"evict after consumed" marker) rather than deleting inside the
PROCESS_EVENT_EXIT handler.
- Around line 261-274: When handling PROCESS_EVENT_EXEC in event_loop
(PROCESS_EVENT_EXEC branch) we invalidate symbol caches
(trace_handler.invalidate_caches_for_pid and process_metadata.invalidate) while
leaving pre-exec samples in trace_count, which causes build_raw_stacks() to
re-symbolize those pre-exec StackInfo against the new image; fix by flushing
this PID's accumulated samples from trace_count before invalidating caches or
else attach an exec-generation version to stored samples so build_raw_stacks()
can skip/resymbolize correctly. Concretely, update the PROCESS_EVENT_EXEC
handler to (a) remove or clear entries for event.pid from trace_count (the data
structure holding per-PID StackInfo counts) before calling
invalidate_caches_for_pid(event.pid) and process_metadata.invalidate(event.pid),
or (b) increment a per-PID exec-generation counter stored alongside known_tgids
and annotate new samples with that generation so build_raw_stacks()
ignores/keeps pre-exec samples tied to the older generation.

In `@profile-bee/src/process_metadata.rs`:
- Around line 118-136: get_or_load currently returns entries purely by pid,
which allows stale metadata to be reused when a PID is recycled; update
get_or_load in ProcessMetadataCache to validate the cached entry against a
stable process identity (e.g. the /proc/[pid]/stat starttime) before returning.
Concretely: when cache.contains_key(&pid) call a validation helper (e.g.
ProcessMetadata::current_start_time(pid) or a new method on ProcessMetadata) and
compare it to the stored entry’s start_time; if they differ, drop the cached
entry, call ProcessMetadata::load(pid) to reload and insert the new metadata,
and then return the fresh entry; otherwise return the cached reference as
before. Ensure this logic is applied in the get_or_load method and that
ProcessMetadata stores the start_time used for validation.

---

Nitpick comments:
In `@profile-bee-ebpf/src/lib.rs`:
- Around line 1429-1443: handle_process_exec currently emits a ProcessExecEvent
for every execve system-wide (using ctx.tgid() and RING_BUF_PROCESS_EXEC), which
can overwhelm high-churn hosts; add an optional PID filter or simple
rate-limiting check inside handle_process_exec to bail out early (e.g., consult
a configurable allowlist/denylist or token-bucket counter) before
reserving/writing to RING_BUF_PROCESS_EXEC so events for uninteresting PIDs or
excessive frequency are skipped—update ProcessExecEvent emission flow and any
config hooks to support enabling/disabling PID filtering at runtime.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 793c12eb-dd91-40dc-a8b4-e14b9344b97f

📥 Commits

Reviewing files that changed from the base of the PR and between 75b93ca and 14cab1a.

⛔ Files ignored due to path filters (2)
  • profile-bee-ebpf/Cargo.lock is excluded by !**/*.lock
  • profile-bee/ebpf-bin/profile-bee.bpf.o is excluded by !**/*.o
📒 Files selected for processing (13)
  • docs/custom_profiling_agent.md
  • profile-bee-common/src/lib.rs
  • profile-bee-ebpf/src/lib.rs
  • profile-bee-ebpf/src/main.rs
  • profile-bee/bin/profile-bee.rs
  • profile-bee/src/cache.rs
  • profile-bee/src/ebpf.rs
  • profile-bee/src/event_loop.rs
  • profile-bee/src/lib.rs
  • profile-bee/src/pipeline.rs
  • profile-bee/src/process_metadata.rs
  • profile-bee/src/session.rs
  • profile-bee/src/trace_handler.rs

Comment thread profile-bee/src/event_loop.rs
Comment thread profile-bee/src/event_loop.rs
Comment thread profile-bee/src/process_metadata.rs
…ct PID reuse

Three correctness fixes for the process lifecycle tracking:

1. Defer ProcessMetadataCache eviction with one-cycle buffer: exit PIDs
   are collected in pending_exit_pids during drain_events() and evicted
   at the START of the next drain_events() call. This gives agents one
   full collection cycle to read metadata for processes that exited,
   covering async ring buffer delivery where a sample may arrive after
   its process's exit event. evict_pending_exits() remains public for
   agents that need immediate eviction under memory pressure.

2. Flush trace_count on exec: pre-exec StackInfo entries were left in
   trace_count after invalidating symbol caches, causing build_raw_stacks()
   to symbolize old addresses against the new binary image (producing
   [unknown] symbols). Now trace_count.retain() removes entries for the
   exec'd PID before cache invalidation.

3. Detect PID reuse via starttime: get_or_load() now reads /proc/[pid]/stat
   starttime and compares it to the cached entry. If they differ, the PID
   was recycled and the stale entry is replaced with fresh metadata.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@profile-bee/src/event_loop.rs`:
- Around line 78-83: pending_exit_pids currently only stores pid (Vec<u32>),
which can match a recycled PID later; change it to store a process identity
(e.g., a small struct/tuple like ProcessIdentity { pid: u32, start_time_ns: u64
} or (u32, u64)) and update all places that push/read it (places around
evict_pending_exits(), where pending_exit_pids is populated during
drain_events()/build_raw_stacks(), and the other occurrences noted) so eviction
checks equality on both pid and start_time (or other unique start marker) rather
than pid alone; update types/signatures (pending_exit_pids, evict_pending_exits,
any iterators) and conversion points where you previously pushed a u32 to
instead capture the process start timestamp from process metadata and store the
identity.

In `@profile-bee/src/process_metadata.rs`:
- Around line 62-64: The current logic caches process entries even when
process.stat() fails (start_time == 0), allowing stale data to bypass PID-reuse
checks; update the loading/cache path (where process.stat() and start_time are
computed — e.g., the code that calls process.stat().ok() and assigns start_time)
to treat a missing stat as a load failure and do not insert the entry into the
cache when start_time == 0 (or when stat() returned None); ensure get_or_load()
and the alternative path at lines ~141-157 follow the same rule so only entries
with a valid start_time are persisted to the cache and PID-reuse detection
remains effective.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: dc1f8222-7d4e-47ef-9bcb-8f613dcce67a

📥 Commits

Reviewing files that changed from the base of the PR and between 14cab1a and f77b7ac.

📒 Files selected for processing (2)
  • profile-bee/src/event_loop.rs
  • profile-bee/src/process_metadata.rs

Comment on lines +78 to 83
/// PIDs that exited during the current drain_events() call.
/// Eviction from `process_metadata` is deferred until after
/// `build_raw_stacks()` completes, so agents can still look up
/// metadata for processes that exited within the collection window.
pending_exit_pids: Vec<u32>,
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Deferred exit eviction should carry process identity, not only PID.

pending_exit_pids stores only pid, so if that PID is recycled before the next drain_events(), evict_pending_exits() can delete metadata for the new process instance.

🩹 Suggested fix (identity-aware deferred eviction)
-    pending_exit_pids: Vec<u32>,
+    pending_exit_pids: Vec<(u32, u64)>,
@@
-            for pid in self.pending_exit_pids.drain(..) {
-                cache.remove(pid);
+            for (pid, exited_start_time) in self.pending_exit_pids.drain(..) {
+                let should_remove = cache
+                    .get(pid)
+                    .map(|m| m.start_time == exited_start_time)
+                    .unwrap_or(false);
+                if should_remove {
+                    cache.remove(pid);
+                }
             }
@@
-                            if self.process_metadata.is_some() {
-                                self.pending_exit_pids.push(event.pid);
+                            if let Some(cache) = self.process_metadata.as_ref() {
+                                if let Some(meta) = cache.get(event.pid) {
+                                    self.pending_exit_pids.push((event.pid, meta.start_time));
+                                }
                             }

Also applies to: 173-177, 274-286

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@profile-bee/src/event_loop.rs` around lines 78 - 83, pending_exit_pids
currently only stores pid (Vec<u32>), which can match a recycled PID later;
change it to store a process identity (e.g., a small struct/tuple like
ProcessIdentity { pid: u32, start_time_ns: u64 } or (u32, u64)) and update all
places that push/read it (places around evict_pending_exits(), where
pending_exit_pids is populated during drain_events()/build_raw_stacks(), and the
other occurrences noted) so eviction checks equality on both pid and start_time
(or other unique start marker) rather than pid alone; update types/signatures
(pending_exit_pids, evict_pending_exits, any iterators) and conversion points
where you previously pushed a u32 to instead capture the process start timestamp
from process metadata and store the identity.

Comment thread profile-bee/src/process_metadata.rs Outdated
zz85 added 2 commits April 1, 2026 00:30
…urn)

sched_process_exit fires for every thread exit, not just process exits.
On thread-heavy hosts (Java, Go, Node.js), this generated thousands of
duplicate ring buffer events per second for the same tgid, each waking
the tokio AsyncFd task and pushing through the mpsc channel.

Add tid != tgid early return so only actual process exits (main thread)
produce events. sched_process_exec already only fires once per execve()
so no filter needed there.
…e entries

load() used map_or(0, ...) when stat() failed, caching entries with
start_time == 0. The get_or_load() validation skipped PID reuse checks
when cached_start == 0, so stale metadata was served indefinitely.

Now load() requires stat() to succeed (early return None on failure),
guaranteeing all cached entries have a valid start_time. Simplified
the get_or_load() validation — the cached_start != 0 guard is no
longer needed since load() enforces it.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
profile-bee/src/process_metadata.rs (1)

156-163: Consider opportunistic dead-entry pruning before hard-cap rejection.

When full, get_or_load() immediately returns None. With ProcessMetadataCache::new(4096) in profile-bee/src/event_loop.rs (Line 143), missed/delayed lifecycle events can make metadata lookup degrade for all new PIDs. A cheap prune pass on full-cache path would improve resilience.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@profile-bee/src/process_metadata.rs` around lines 156 - 163, In
get_or_load(), instead of immediately returning None when self.cache.len() >=
self.max_entries, perform a cheap opportunistic prune of dead/stale entries from
self.cache (e.g., remove entries whose metadata indicates the process exited or
whose last_seen timestamp is older than a short threshold), then re-check
capacity and proceed to insert; if the cache remains full after pruning, emit
the existing warn and return None. Update logic around the existing capacity
check in get_or_load() (the branch that uses self.max_entries) so the prune step
runs before the hard-cap rejection; keep pruning cheap and non-blocking to avoid
latency regressions (this complements ProcessMetadataCache::new(4096) usage in
event_loop.rs).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@profile-bee-ebpf/src/lib.rs`:
- Around line 75-79: The exec tracepoint handler emits events regardless of the
runtime gate; update handle_process_exec (and the other exec-emitting sites
noted) to check lifecycle_tracking_enabled() (which reads
LIFECYCLE_TRACKING_MAP) before allocating or writing exec notifications so
events are only produced when lifecycle tracking is enabled; bail out early from
handle_process_exec and the same exec emission paths when
lifecycle_tracking_enabled() returns false to avoid host-wide ring-buffer
pressure and cache churn.

In `@profile-bee/src/process_metadata.rs`:
- Around line 137-154: When a cached PID is found in get_or_load, current_start
== 0 (ProcessMetadata::current_start_time returning 0) must be treated as
"process gone" instead of returning stale data; change the branch in get_or_load
so that when self.cache.contains_key(&pid) and current_start == 0 you remove the
cached entry from self.cache and return None (rather than falling through to the
existing else that returns the stale entry), keeping the existing behavior for
the PID-reuse case when current_start != cached_start (remove and reload).

---

Nitpick comments:
In `@profile-bee/src/process_metadata.rs`:
- Around line 156-163: In get_or_load(), instead of immediately returning None
when self.cache.len() >= self.max_entries, perform a cheap opportunistic prune
of dead/stale entries from self.cache (e.g., remove entries whose metadata
indicates the process exited or whose last_seen timestamp is older than a short
threshold), then re-check capacity and proceed to insert; if the cache remains
full after pruning, emit the existing warn and return None. Update logic around
the existing capacity check in get_or_load() (the branch that uses
self.max_entries) so the prune step runs before the hard-cap rejection; keep
pruning cheap and non-blocking to avoid latency regressions (this complements
ProcessMetadataCache::new(4096) usage in event_loop.rs).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c6253d12-c345-4eb9-ad2a-8f15f4474075

📥 Commits

Reviewing files that changed from the base of the PR and between f77b7ac and 1cda34f.

⛔ Files ignored due to path filters (1)
  • profile-bee/ebpf-bin/profile-bee.bpf.o is excluded by !**/*.o
📒 Files selected for processing (2)
  • profile-bee-ebpf/src/lib.rs
  • profile-bee/src/process_metadata.rs

Comment on lines +75 to +79
/// Whether process lifecycle tracking is enabled (0 = disabled, 1 = enabled).
/// When enabled, exit events fire for ALL process exits (not just DWARF-tracked
/// or monitored PIDs), and the exec tracepoint sends exec events.
#[map(name = "lifecycle_tracking_map")]
static LIFECYCLE_TRACKING_MAP: Array<u32> = Array::with_max_entries(1, 0);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Gate exec notifications on lifecycle_tracking_enabled().

LIFECYCLE_TRACKING_MAP is documented as the runtime switch for exec notifications, but handle_process_exec() ignores it and emits on every attached exec. That creates avoidable host-wide ring-buffer pressure and cache churn when lifecycle tracking is off.

Suggested guard
 pub unsafe fn handle_process_exec<C: EbpfContext>(ctx: C) {
     use profile_bee_common::ProcessExecEvent;
 
+    if !lifecycle_tracking_enabled() {
+        return;
+    }
+
     let tgid = ctx.tgid();
 
     if let Some(mut entry) = RING_BUF_PROCESS_EXEC.reserve::<ProcessExecEvent>(0) {
         let exec_event = ProcessExecEvent { pid: tgid, _pad: 0 };
         let _writable = entry.write(exec_event);

Also applies to: 1448-1457

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@profile-bee-ebpf/src/lib.rs` around lines 75 - 79, The exec tracepoint
handler emits events regardless of the runtime gate; update handle_process_exec
(and the other exec-emitting sites noted) to check lifecycle_tracking_enabled()
(which reads LIFECYCLE_TRACKING_MAP) before allocating or writing exec
notifications so events are only produced when lifecycle tracking is enabled;
bail out early from handle_process_exec and the same exec emission paths when
lifecycle_tracking_enabled() returns false to avoid host-wide ring-buffer
pressure and cache churn.

Comment on lines +1413 to +1419
let tid = ctx.pid();
let tgid = ctx.tgid();

// Skip thread exits — only fire for the main thread (process exit).
if tid != tgid {
return;
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

tid == tgid is not a reliable “process is gone” check.

The thread-group leader can exit via pthread_exit() while other threads in the same tgid keep running. This branch would emit a ProcessExitEvent early, which can stop --pid profiling and evict DWARF/process metadata for a still-live process, while the real last-thread exit may be filtered out later. Please key this off actual thread-group teardown, not leader-thread exit.

get_or_load() returned stale cached entries when current_start_time()
returned 0 (process exited, /proc/[pid]/stat unreadable). The condition
`current_start != 0 && current_start != cached_start` evaluated to
false, falling through to the else branch that returned the stale entry.

Now current_start == 0 is handled explicitly: remove the stale entry
and return None, since the process no longer exists.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
profile-bee/src/process_metadata.rs (1)

161-167: Don't warn on every miss after saturation.

Once the cache is full, every uncached PID will emit a warning. On a workload with lots of short-lived processes, that can turn into log spam from the sampling path. Consider logging once per “full” state and using a counter/metric for the rest.

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@profile-bee/src/process_metadata.rs`:
- Around line 137-159: The public getter get() currently skips the PID/exit
revalidation that get_or_load() performs, allowing callers to observe stale
cmdline/exe/environ; either rename get() to get_cached_unchecked() (and update
callers) to make unchecked semantics explicit, or modify get() to perform the
same validation as get_or_load() (reuse ProcessMetadata::current_start_time(pid)
and the cache removal/reload logic) and then forward to get_or_load() or return
the validated cached entry; apply the same change for the analogous method
referenced around lines 177-180 so all public accessors either validate or are
explicitly named as unchecked.
- Around line 34-35: The derived Debug on ProcessMetadata can leak secrets via
the environ and cmdline fields; remove Debug from the derive on the
ProcessMetadata struct and add a manual impl of std::fmt::Debug for
ProcessMetadata that redacts or omits sensitive content (e.g., print
placeholders or counts like "<redacted environ (N entries)>" and "<redacted
cmdline (N args)>" while still showing non-sensitive fields), retain
#[derive(Clone)] for Clone, and implement the Debug impl using fmt::DebugStruct
(or fmt::Formatter) to format the other fields safely.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b1d98863-9c8f-4497-9a10-d2bbaf08d092

📥 Commits

Reviewing files that changed from the base of the PR and between 1cda34f and 7d3ca02.

📒 Files selected for processing (1)
  • profile-bee/src/process_metadata.rs

Comment thread profile-bee/src/process_metadata.rs Outdated
Comment on lines +137 to +159
pub fn get_or_load(&mut self, pid: u32) -> Option<&ProcessMetadata> {
if self.cache.contains_key(&pid) {
// Validate against PID reuse and process exit: compare the cached
// start_time against the current /proc/[pid]/stat starttime.
let cached_start = self.cache[&pid].start_time;
let current_start = ProcessMetadata::current_start_time(pid);
if current_start == 0 {
// Process is gone (/proc/[pid]/stat unreadable) — remove stale entry.
self.cache.remove(&pid);
return None;
}
if current_start != cached_start {
tracing::debug!(
"PID {} recycled (starttime {} -> {}), reloading metadata",
pid,
cached_start,
current_start,
);
self.cache.remove(&pid);
// Fall through to reload below
} else {
return self.cache.get(&pid);
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Make the unchecked get() semantics explicit.

get_or_load() does the PID/exit revalidation, but get() skips it entirely. Because the validated path still reopens /proc/[pid]/stat on cache hits, callers on a hot path are likely to reach for get() and can then read stale cmdline/exe/environ after a missed lifecycle event. I'd either rename this to something like get_cached_unchecked() or make the public getter validated as well.

Also applies to: 177-180

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@profile-bee/src/process_metadata.rs` around lines 137 - 159, The public
getter get() currently skips the PID/exit revalidation that get_or_load()
performs, allowing callers to observe stale cmdline/exe/environ; either rename
get() to get_cached_unchecked() (and update callers) to make unchecked semantics
explicit, or modify get() to perform the same validation as get_or_load() (reuse
ProcessMetadata::current_start_time(pid) and the cache removal/reload logic) and
then forward to get_or_load() or return the validated cached entry; apply the
same change for the analogous method referenced around lines 177-180 so all
public accessors either validate or are explicitly named as unchecked.

Replace derived Debug with manual impl that shows entry counts instead
of values ('<N args>', '<N vars>') to prevent secrets like
AWS_SECRET_ACCESS_KEY from leaking into logs via {:?} formatting.
@zz85 zz85 merged commit dcc0294 into main Apr 1, 2026
3 of 4 checks passed
@zz85 zz85 deleted the process_lifecycle branch April 1, 2026 05:33
@coderabbitai coderabbitai Bot mentioned this pull request Apr 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant