Skip to content

fix: 修复启动握手超时崩溃 + 治理 telemetry 目录无界增长#200

Merged
yishuiliunian merged 2 commits into
mainfrom
fix/startup-hang-and-telemetry-leak
Jun 16, 2026
Merged

fix: 修复启动握手超时崩溃 + 治理 telemetry 目录无界增长#200
yishuiliunian merged 2 commits into
mainfrom
fix/startup-hang-and-telemetry-leak

Conversation

@yishuiliunian

Copy link
Copy Markdown
Contributor

Summary

  • 修复 loopal 启动崩溃 hub child did not produce a handshake within 30s:hub-only 子进程不再重复父进程刚跑过的全量 session 扫描
  • ~/.loopal/telemetry 加 retention 硬上限 + exporter 改 lazy-create,根治目录无界增长(实测从 55k 文件自愈收敛)

Root cause

启动期 cleanup_bash_log_orphans() → list_sessions() 遍历并解析 ~/.loopal/sessions 下全部 session.json。在大目录(实测 37k 目录 / 20k session)上冷启动 >30s。该扫描在父进程和 --hub-only 子进程各跑一遍,子进程超时被当孤儿杀掉。同源问题:telemetry 目录每进程留一对 traces-/metrics- 文件且无任何清理,累积到 55k(62% 为空壳)。

Changes

  • src/bootstrap/mod.rs — hub-only 分支跳过重复的孤儿扫描(父进程已执行)
  • crates/loopal-telemetry/src/retention.rs(新增)— cleanup_telemetry_files:每类文件数 + 体积双上限,alive-PID-safe,前缀白名单(绝不动 secret_access.jsonl
  • crates/loopal-telemetry/src/lazy_jsonl.rs(新增)— LazyJsonlFile:首次写入才建文件
  • file_span_exporter.rs / file_metric_exporter.rs — 改用惰性 sink;new() 保留 create_dir_all 以保住坏目录告警契约
  • src/logging.rs — 在 init_logging 调用 telemetry 清理(紧邻既有 cleanup_old_logs

Test plan

  • bazel test //crates/loopal-telemetry 通过(新增 6 个单测 + 既有测试)
  • bazel build //:loopal --config=clippy 零警告
  • 端到端:GTMApps 下启动从超时变为 2.5s Hub listening;telemetry 55k → 270 持续收敛,多进程短命会话不再产生空壳
  • CI passes

The --hub-only child is always spawned by a parent that ran the same
cleanup_bash_log_orphans() seconds earlier. That scan reads and parses
every session.json under ~/.loopal/sessions; on a large tree it exceeds
the 30s hub handshake timeout (hub_spawn::HANDSHAKE_TIMEOUT), so the
parent kills the child as an unreachable orphan and startup crashes with
"hub child did not produce a handshake within 30s".

Skip the orphan scan in the hub-only branch; the spawning parent already
ran it. Every other entry (incl. serve / acp daemons) still owns sessions
and keeps running it to bound tmp growth.
The traces-/metrics- JSONL exporters created one file pair per process
with no retention, so ~/.loopal/telemetry grew monotonically across the
multiprocess fleet (observed: 55k files, 62% empty). Unlike logs/ and
tmp/, the dir had no housekeeping.

Two complementary defenses:
- retention: cleanup_telemetry_files enforces a per-kind file-count and
  byte-size ceiling at init_logging, alive-PID-safe and prefix-whitelisted
  (only traces-/metrics-, never touches secret_access.jsonl). Mirrors the
  existing log_writer::cleanup_old_logs pattern.
- lazy-create: LazyJsonlFile defers file creation to the first write, so
  short-lived processes that emit no spans/metrics stop leaving empty
  files. new() still create_dir_all(dir) to preserve the bad-dir warning.
@yishuiliunian yishuiliunian merged commit 153d02c into main Jun 16, 2026
4 checks passed
@yishuiliunian yishuiliunian deleted the fix/startup-hang-and-telemetry-leak branch June 16, 2026 08:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant