Rebase fork onto upstream tokio 1.52.3 (1.52.7003+anthropic)#1
Draft
rpb-ant wants to merge 13 commits into
Draft
Rebase fork onto upstream tokio 1.52.3 (1.52.7003+anthropic)#1rpb-ant wants to merge 13 commits into
rpb-ant wants to merge 13 commits into
Conversation
Add a `stall-detection` feature that detects when a tokio worker thread
is stalled (blocked in a task poll for too long) and reports diagnostics
including stack traces.
How it works:
- Each worker maintains a generation counter (local increment + Release
store to a per-worker AtomicU64 in WorkerMetrics). Odd = currently
polling a task, even = idle.
- A background monitor thread polls these counters periodically. If a
counter has the same odd value on successive polls, that worker is
stalled.
- On detection, captures a user-space stack trace by sending a realtime
signal to the stalled thread. The signal handler uses a fully async-
signal-safe frame-pointer walker (no locks, no heap allocation, no
dl_iterate_phdr). Also captures the kernel stack via procfs.
- Reports stall events via tracing (warn for resolved, error for
ongoing). Optional on_stall callback for programmatic access.
- Signal number is chosen dynamically by probing for free realtime
signals, avoiding conflicts with other signal-based tools.
Usage:
let rt = tokio::runtime::Builder::new_multi_thread()
.enable_stall_detection()
.stall_detection_poll_interval(Duration::from_millis(100))
.stall_detection_escalation_threshold(Duration::from_secs(10))
.build()
.unwrap();
Performance overhead:
- Generation counter: ~2ns per task poll on x86 (two Release stores
to an uncontended, cache-line-aligned atomic). ~8ns on ARM.
- Monitor thread: one std::thread doing periodic reads of N atomics.
- Signal-based trace capture: only fires when a stall is detected.
- Zero overhead when feature is not enabled.
The GitHub repo has an existing 'publish' environment. Update the workflow to reference it instead of 'publish-cli'. The environment name must match the OIDC subject claim in the Terraform config.
Moves `Builder::rng_seed`, the `runtime::RngSeed` re-export, and `RngSeed::from_bytes` out of `cfg_unstable!` so callers can seed the scheduler RNG (and thus get deterministic `select!` branch ordering) without `--cfg tokio_unstable`. The underlying `seed_generator` field and all runtime plumbing were already unconditional; only the public setter and type re-export were gated. No behavior change for builds that already set `tokio_unstable`. Upstream tracking issue: tokio-rs#4879 (introduced in tokio-rs#4910).
Cargo ignores build metadata for version identity, so 1.49.0+anthropic.2 collides with the already-published 1.49.0+anthropic.1 and the publish is rejected. Move the release counter into the patch number using P = N * 1000 + upstream_patch so each fork release is a distinct semver version that never overlaps an upstream patch number. Keep +anthropic as a constant build-metadata marker. :house: Remote-Dev: homespace
Add `symbolicated_frames: Vec<String>` to `StallInfo` so a callback can forward the resolved frames (e.g. to Sentry) without re-running `backtrace::resolve` on the same IPs the monitor thread already symbolicated for the log line. The monitor was already symbolicating inside `format_trace`. Pull that call up into `emit_resolved`/`emit_escalation`, pass the slice to `format_trace`, and move the vec into the `StallInfo` handed to the callback. Log output is unchanged. A non-Linux stub of `symbolicate_trace` keeps the call site portable.
A process running multiple tokio runtimes (each with its own `Builder::thread_name(...)` prefix) currently emits stall events that downstream can't tell apart -- the `worker` index is reused across runtimes. Surface the configured thread name so log/Sentry/metrics consumers can group stalls by runtime via the thread-name prefix. Capture each worker's `std::thread::current().name()` at startup into a new `WorkerMetrics::thread_name: OnceLock<String>` and expose it via `RuntimeMetrics::worker_thread_name(idx)`. The stall monitor reads the name when it first detects a stall, stores it alongside the stack trace and blocking-pool snapshot, and includes it on both: - the `tracing::warn!`/`error!` event as a `thread_name` structured field (the message templates are unchanged so issue-grouping by template stays stable), - the `StallInfo::thread_name` handed to the `on_stall` callback. This is the full untruncated thread name -- not the kernel's 15-byte `comm` -- so a configured prefix survives intact and works on non-Linux too.
…tup path The thread-name capture added in anthropics#14 was wired into two of the three per-worker startup sites that record os_thread_id, but missed the primary one: `fn run(worker)` in multi_thread/worker.rs, which every multi-thread worker passes through exactly once when its thread starts. The other multi-thread site only fires on a core-reacquire handoff (e.g. after `block_in_place`), which a worker that stalls on its first task never hits. `WorkerMetrics::thread_name` is a `OnceLock`, so a worker that reaches its first stall without `fn run()` having set it reports `thread_name = None` -- exactly the case the field was added to cover. Add the same `thread::current().name()` -> `metrics.thread_name.set()` block to `fn run()`. Also add a test that installs an `on_stall` callback, configures a `Builder::thread_name`, triggers a stall, and asserts the callback receives the configured name -- this fails without the fix.
The fork delta (stall-detection feature, Builder::rng_seed stabilization, publish workflow) is unchanged: 23 files, +2150/-47, identical to the delta carried on anthropic-1.49.0. Upstream 1.49.0 -> 1.52.3 brings ~5 months of fixes; the two scheduler changes that touched our hook points (LIFO-slot stealing, sharded spawn_blocking queue) were both reverted upstream before 1.52.3, so the runtime structure the fork instruments is unchanged. Publish trigger and docs now reference the anthropic-1.52.3 branch. The GitHub 'publish' environment branch restriction must be updated in repo settings to match (cannot be done in code). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Note
Why this PR is in
rpb-ant/tokiorather thananthropics/tokio: the fork repo is push-restricted (rpb has pull-only access), so the review surface lives here. The base branchupstream-tokio-1.52.3is the pristine upstream tag, so the diff below is exactly the fork's 13-commit delta — nothing else. Once blessed, a maintainer pushesanthropic-1.52.3toanthropics/tokio(commands in the "Maintainer actions" section at the bottom).Rebase fork onto upstream tokio 1.52.3 →
1.52.7003+anthropicWhy now
The anthropic monorepo needs starlark 0.14 (Meta's Buck2 Starlark interpreter, for the
starbuf typed-config project, PR #539931). starlark 0.14 has a mandatory dependency on Meta's
pagablepersistence crate, which requirestokio ^1.52.1— a requirement our fork, basedon upstream 1.49.0, cannot satisfy. Since
[patch.crates-io]makes the fork the only tokioin the workspace, the whole monorepo is capped at 1.49 until the fork moves.
(The
^1.52.1pin is itself an artifact of Meta's release tooling — pagable only usestokio-1.0-era APIs — but their pins will always track their current tokio, so keeping the
fork reasonably fresh is the durable fix. Recon with the full analysis:
anthropic monorepo → ~rpb/foo/starbuf-antline/notes/2026-06-01-tokio-fork-recon.md.)This also picks up ~5 months of upstream fixes (mpsc bugs, a spawn_blocking hang fix, RwLock
fixes, io-uring work).
What this branch is
anthropic-1.52.3, cut from upstream tagtokio-1.52.3, carrying the identical fork deltaas
anthropic-1.49.0:(njs ×5, sujay ×2, edwin ×5; the two empty "ci: trigger publish retry" commits were dropped),
plus one release commit (version + publish-trigger/doc updates).
difference is a merged TOML section header, see conflicts below).
1.52.7003+anthropic(N=7 continues the monotonic counter per thepatch-offset scheme; upstream_patch=3).
Conflicts hit during the rebase (4 total, all trivial)
examples/Cargo.tomlprewarm-fd-tableexample where the fork adds itsstall_detectionexample[[example]]entriesexamples/Cargo.toml(semantic, found by build)[target.'cfg(target_os = "linux")'.dev-dependencies]section; the fork's commit also added one → duplicate TOML keytokio = {..., "stall-detection"}dev-dep into upstream's existing sectiontokio/src/runtime/builder.rs×2enable_eager_driver_handofffield/initializer at the same spot as the fork'sstall_detection_configtokio/Cargo.toml(×7 commits)1.49.N000→1.52.N003)Why the risk is low
The two upstream changes in the 1.49→1.52 window that touched the fork's hook points — LIFO-slot
stealing (1.51.0 tokio-rs#7431) and the sharded spawn_blocking queue (1.52.0 tokio-rs#7757) — were both
reverted upstream (1.52.2 tokio-rs#8100 / 1.52.1 tokio-rs#8057). The scheduler structure the stall-detection
feature instruments is unchanged at 1.52.3. The files the fork extends most
(
metrics/worker.rs,metrics/runtime.rs,blocking/mod.rs,util/rand.rs) are byte-identicalbetween 1.49.0 and 1.52.3;
multi_thread/worker.rsauto-merged.Test results (this branch, locally)
cargo build --features full,test-util,tracing,stall-detectiontests/rt_stall_detection.rs(the fork's own feature)rt_threaded+rt_common(scheduler/conflict areas)rt_metrics,rt_basic,rt_handle_block_on,task_blockingFull-chain verification against the monorepo
With the monorepo's
[patch.crates-io]temporarily pointed at this branch and starlark bumpedto 0.14 (the consuming change this rebase exists for):
cargo update: starlark 0.14 + pagable 0.3.3 + this tokio + blake3 1.8.2 unifycargo check/test -p starbuf -p starbuf-cli(the 0.14 consumer)cargo check -p a3cmd --all-targets(first stall-detection consumer)cargo check -p antline-api(large tokio+sqlx consumer)What's needed beyond this PR (monorepo side, separate PR there)
[patch.crates-io] tokioto=1.52.7003in the root workspace + the 3 separateworkspaces (o11y_services/roots, o11y_services/leafcutter, ditto/cli-tools);
cargo update -p tokio.buildinfra/patches/tokio.patch+ cargo-bazel locks (the bazel side renders thefork delta as a patch on upstream).
socket2, once_cell (tokio 1.52.3's own minimums).
Repo-settings changes needed (cannot be done in code)
publishenvironment branch restriction must allowanthropic-1.52.3(currently restricted to
anthropic-1.49.0).anthropic-1.52.3after merge(#sec-eng-assist ticket).
🤖 Generated with Claude Code
Maintainer actions needed (anthropics/tokio settings + push)
git fetch https://github.com/rpb-ant/tokio.git anthropic-1.52.3 && git push origin FETCH_HEAD:refs/heads/anthropic-1.52.3anthropic-1.52.3(publish-environment branch restriction currently pins toanthropic-1.49.0)anthropic-1.52.3once published1.52.7003+anthropiclands in crates-internal🤖 Generated with Claude Code