fix: 修复启动握手超时崩溃 + 治理 telemetry 目录无界增长#200
Merged
Merged
Conversation
The --hub-only child is always spawned by a parent that ran the same cleanup_bash_log_orphans() seconds earlier. That scan reads and parses every session.json under ~/.loopal/sessions; on a large tree it exceeds the 30s hub handshake timeout (hub_spawn::HANDSHAKE_TIMEOUT), so the parent kills the child as an unreachable orphan and startup crashes with "hub child did not produce a handshake within 30s". Skip the orphan scan in the hub-only branch; the spawning parent already ran it. Every other entry (incl. serve / acp daemons) still owns sessions and keeps running it to bound tmp growth.
The traces-/metrics- JSONL exporters created one file pair per process with no retention, so ~/.loopal/telemetry grew monotonically across the multiprocess fleet (observed: 55k files, 62% empty). Unlike logs/ and tmp/, the dir had no housekeeping. Two complementary defenses: - retention: cleanup_telemetry_files enforces a per-kind file-count and byte-size ceiling at init_logging, alive-PID-safe and prefix-whitelisted (only traces-/metrics-, never touches secret_access.jsonl). Mirrors the existing log_writer::cleanup_old_logs pattern. - lazy-create: LazyJsonlFile defers file creation to the first write, so short-lived processes that emit no spans/metrics stop leaving empty files. new() still create_dir_all(dir) to preserve the bad-dir warning.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
loopal启动崩溃hub child did not produce a handshake within 30s:hub-only 子进程不再重复父进程刚跑过的全量 session 扫描~/.loopal/telemetry加 retention 硬上限 + exporter 改 lazy-create,根治目录无界增长(实测从 55k 文件自愈收敛)Root cause
启动期
cleanup_bash_log_orphans() → list_sessions()遍历并解析~/.loopal/sessions下全部 session.json。在大目录(实测 37k 目录 / 20k session)上冷启动 >30s。该扫描在父进程和--hub-only子进程各跑一遍,子进程超时被当孤儿杀掉。同源问题:telemetry 目录每进程留一对traces-/metrics-文件且无任何清理,累积到 55k(62% 为空壳)。Changes
src/bootstrap/mod.rs— hub-only 分支跳过重复的孤儿扫描(父进程已执行)crates/loopal-telemetry/src/retention.rs(新增)—cleanup_telemetry_files:每类文件数 + 体积双上限,alive-PID-safe,前缀白名单(绝不动secret_access.jsonl)crates/loopal-telemetry/src/lazy_jsonl.rs(新增)—LazyJsonlFile:首次写入才建文件file_span_exporter.rs/file_metric_exporter.rs— 改用惰性 sink;new()保留create_dir_all以保住坏目录告警契约src/logging.rs— 在init_logging调用 telemetry 清理(紧邻既有cleanup_old_logs)Test plan
bazel test //crates/loopal-telemetry通过(新增 6 个单测 + 既有测试)bazel build //:loopal --config=clippy零警告Hub listening;telemetry 55k → 270 持续收敛,多进程短命会话不再产生空壳