Skip to content

Rebase fork onto upstream tokio 1.52.3 (1.52.7003+anthropic)#1

Draft
rpb-ant wants to merge 13 commits into
upstream-tokio-1.52.3from
anthropic-1.52.3
Draft

Rebase fork onto upstream tokio 1.52.3 (1.52.7003+anthropic)#1
rpb-ant wants to merge 13 commits into
upstream-tokio-1.52.3from
anthropic-1.52.3

Conversation

@rpb-ant

@rpb-ant rpb-ant commented Jun 1, 2026

Copy link
Copy Markdown
Owner

Note

Why this PR is in rpb-ant/tokio rather than anthropics/tokio: the fork repo is push-restricted (rpb has pull-only access), so the review surface lives here. The base branch upstream-tokio-1.52.3 is the pristine upstream tag, so the diff below is exactly the fork's 13-commit delta — nothing else. Once blessed, a maintainer pushes anthropic-1.52.3 to anthropics/tokio (commands in the "Maintainer actions" section at the bottom).

Rebase fork onto upstream tokio 1.52.3 → 1.52.7003+anthropic

Why now

The anthropic monorepo needs starlark 0.14 (Meta's Buck2 Starlark interpreter, for the
starbuf typed-config project, PR #539931). starlark 0.14 has a mandatory dependency on Meta's
pagable persistence crate, which requires tokio ^1.52.1 — a requirement our fork, based
on upstream 1.49.0, cannot satisfy. Since [patch.crates-io] makes the fork the only tokio
in the workspace, the whole monorepo is capped at 1.49 until the fork moves.

(The ^1.52.1 pin is itself an artifact of Meta's release tooling — pagable only uses
tokio-1.0-era APIs — but their pins will always track their current tokio, so keeping the
fork reasonably fresh is the durable fix. Recon with the full analysis:
anthropic monorepo → ~rpb/foo/starbuf-antline/notes/2026-06-01-tokio-fork-recon.md.)

This also picks up ~5 months of upstream fixes (mpsc bugs, a spawn_blocking hang fix, RwLock
fixes, io-uring work).

What this branch is

anthropic-1.52.3, cut from upstream tag tokio-1.52.3, carrying the identical fork delta
as anthropic-1.49.0:

  • 13 commits: the 12 substantive fork commits cherry-picked with original authorship
    (njs ×5, sujay ×2, edwin ×5; the two empty "ci: trigger publish retry" commits were dropped),
    plus one release commit (version + publish-trigger/doc updates).
  • Delta vs upstream: 23 files, +2,149/−47 (vs +2,150/−47 on the 1.49 branch — the one-line
    difference is a merged TOML section header, see conflicts below).
  • Final version: 1.52.7003+anthropic (N=7 continues the monotonic counter per the
    patch-offset scheme; upstream_patch=3).

Conflicts hit during the rebase (4 total, all trivial)

where what resolution
examples/Cargo.toml upstream added the prewarm-fd-table example where the fork adds its stall_detection example kept both [[example]] entries
examples/Cargo.toml (semantic, found by build) upstream 1.52.3 now has its own [target.'cfg(target_os = "linux")'.dev-dependencies] section; the fork's commit also added one → duplicate TOML key merged the fork's tokio = {..., "stall-detection"} dev-dep into upstream's existing section
tokio/src/runtime/builder.rs ×2 upstream added enable_eager_driver_handoff field/initializer at the same spot as the fork's stall_detection_config kept both fields / both initializers
tokio/Cargo.toml (×7 commits) version-line context drift (each fork commit bumps the version) translated to the new base (1.49.N0001.52.N003)

Why the risk is low

The two upstream changes in the 1.49→1.52 window that touched the fork's hook points — LIFO-slot
stealing (1.51.0 tokio-rs#7431) and the sharded spawn_blocking queue (1.52.0 tokio-rs#7757) — were both
reverted upstream
(1.52.2 tokio-rs#8100 / 1.52.1 tokio-rs#8057). The scheduler structure the stall-detection
feature instruments is unchanged at 1.52.3. The files the fork extends most
(metrics/worker.rs, metrics/runtime.rs, blocking/mod.rs, util/rand.rs) are byte-identical
between 1.49.0 and 1.52.3; multi_thread/worker.rs auto-merged.

Test results (this branch, locally)

suite result
cargo build --features full,test-util,tracing,stall-detection
tests/rt_stall_detection.rs (the fork's own feature) ✅ 15/15
rt_threaded + rt_common (scheduler/conflict areas) ✅ 139/139 + 28/28
rt_metrics, rt_basic, rt_handle_block_on, task_blocking ✅ 76 passed, 21 ignored (pre-existing ignores)
doc tests ✅ 748/748
loom tests not run locally (CI runs them)

Full-chain verification against the monorepo

With the monorepo's [patch.crates-io] temporarily pointed at this branch and starlark bumped
to 0.14 (the consuming change this rebase exists for):

check result
cargo update: starlark 0.14 + pagable 0.3.3 + this tokio + blake3 1.8.2 unify ✅ first time ever
cargo check/test -p starbuf -p starbuf-cli (the 0.14 consumer) ✅ 42/42 tests
cargo check -p a3cmd --all-targets (first stall-detection consumer)
cargo check -p antline-api (large tokio+sqlx consumer)

What's needed beyond this PR (monorepo side, separate PR there)

  1. Bump [patch.crates-io] tokio to =1.52.7003 in the root workspace + the 3 separate
    workspaces (o11y_services/roots, o11y_services/leafcutter, ditto/cli-tools); cargo update -p tokio.
  2. Regenerate buildinfra/patches/tokio.patch + cargo-bazel locks (the bazel side renders the
    fork delta as a patch on upstream).
  3. Forced transitive moves that ride along: mio 1.0.4→1.2.1, tokio-macros 2.6.0→2.7.0,
    socket2, once_cell (tokio 1.52.3's own minimums).
  4. Full monorepo CI + the usual staged rollout watchfulness (tokio underpins every prod Rust service).

Repo-settings changes needed (cannot be done in code)

  • The GitHub publish environment branch restriction must allow anthropic-1.52.3
    (currently restricted to anthropic-1.49.0).
  • Per go/fork: flip the repo default branch to anthropic-1.52.3 after merge
    (#sec-eng-assist ticket).

🤖 Generated with Claude Code

Maintainer actions needed (anthropics/tokio settings + push)

  1. Push the blessed branch into the fork repo: git fetch https://github.com/rpb-ant/tokio.git anthropic-1.52.3 && git push origin FETCH_HEAD:refs/heads/anthropic-1.52.3
  2. Allow the publish workflow to run from anthropic-1.52.3 (publish-environment branch restriction currently pins to anthropic-1.49.0)
  3. Optionally flip the repo default branch to anthropic-1.52.3 once published
  4. Trigger the publish → 1.52.7003+anthropic lands in crates-internal

🤖 Generated with Claude Code

njsmith and others added 13 commits June 1, 2026 17:22
Add a `stall-detection` feature that detects when a tokio worker thread
is stalled (blocked in a task poll for too long) and reports diagnostics
including stack traces.

How it works:

- Each worker maintains a generation counter (local increment + Release
  store to a per-worker AtomicU64 in WorkerMetrics). Odd = currently
  polling a task, even = idle.
- A background monitor thread polls these counters periodically. If a
  counter has the same odd value on successive polls, that worker is
  stalled.
- On detection, captures a user-space stack trace by sending a realtime
  signal to the stalled thread. The signal handler uses a fully async-
  signal-safe frame-pointer walker (no locks, no heap allocation, no
  dl_iterate_phdr). Also captures the kernel stack via procfs.
- Reports stall events via tracing (warn for resolved, error for
  ongoing). Optional on_stall callback for programmatic access.
- Signal number is chosen dynamically by probing for free realtime
  signals, avoiding conflicts with other signal-based tools.

Usage:

    let rt = tokio::runtime::Builder::new_multi_thread()
        .enable_stall_detection()
        .stall_detection_poll_interval(Duration::from_millis(100))
        .stall_detection_escalation_threshold(Duration::from_secs(10))
        .build()
        .unwrap();

Performance overhead:

- Generation counter: ~2ns per task poll on x86 (two Release stores
  to an uncontended, cache-line-aligned atomic). ~8ns on ARM.
- Monitor thread: one std::thread doing periodic reads of N atomics.
- Signal-based trace capture: only fires when a stall is detected.
- Zero overhead when feature is not enabled.
The GitHub repo has an existing 'publish' environment. Update the workflow
to reference it instead of 'publish-cli'. The environment name must match
the OIDC subject claim in the Terraform config.
Moves `Builder::rng_seed`, the `runtime::RngSeed` re-export, and
`RngSeed::from_bytes` out of `cfg_unstable!` so callers can seed the
scheduler RNG (and thus get deterministic `select!` branch ordering)
without `--cfg tokio_unstable`.

The underlying `seed_generator` field and all runtime plumbing were
already unconditional; only the public setter and type re-export were
gated. No behavior change for builds that already set `tokio_unstable`.

Upstream tracking issue: tokio-rs#4879 (introduced in tokio-rs#4910).
Cargo ignores build metadata for version identity, so 1.49.0+anthropic.2
collides with the already-published 1.49.0+anthropic.1 and the publish is
rejected. Move the release counter into the patch number using
P = N * 1000 + upstream_patch so each fork release is a distinct semver
version that never overlaps an upstream patch number. Keep +anthropic as
a constant build-metadata marker.

:house: Remote-Dev: homespace
Add `symbolicated_frames: Vec<String>` to `StallInfo` so a callback can
forward the resolved frames (e.g. to Sentry) without re-running
`backtrace::resolve` on the same IPs the monitor thread already
symbolicated for the log line.

The monitor was already symbolicating inside `format_trace`. Pull that
call up into `emit_resolved`/`emit_escalation`, pass the slice to
`format_trace`, and move the vec into the `StallInfo` handed to the
callback. Log output is unchanged. A non-Linux stub of
`symbolicate_trace` keeps the call site portable.
A process running multiple tokio runtimes (each with its own
`Builder::thread_name(...)` prefix) currently emits stall events that
downstream can't tell apart -- the `worker` index is reused across
runtimes. Surface the configured thread name so log/Sentry/metrics
consumers can group stalls by runtime via the thread-name prefix.

Capture each worker's `std::thread::current().name()` at startup into a
new `WorkerMetrics::thread_name: OnceLock<String>` and expose it via
`RuntimeMetrics::worker_thread_name(idx)`. The stall monitor reads the
name when it first detects a stall, stores it alongside the stack
trace and blocking-pool snapshot, and includes it on both:

- the `tracing::warn!`/`error!` event as a `thread_name` structured
  field (the message templates are unchanged so issue-grouping by
  template stays stable),
- the `StallInfo::thread_name` handed to the `on_stall` callback.

This is the full untruncated thread name -- not the kernel's 15-byte
`comm` -- so a configured prefix survives intact and works on non-Linux
too.
…tup path

The thread-name capture added in anthropics#14 was wired into two of the three
per-worker startup sites that record os_thread_id, but missed the
primary one: `fn run(worker)` in multi_thread/worker.rs, which every
multi-thread worker passes through exactly once when its thread starts.
The other multi-thread site only fires on a core-reacquire handoff
(e.g. after `block_in_place`), which a worker that stalls on its first
task never hits.

`WorkerMetrics::thread_name` is a `OnceLock`, so a worker that reaches
its first stall without `fn run()` having set it reports
`thread_name = None` -- exactly the case the field was added to cover.

Add the same `thread::current().name()` -> `metrics.thread_name.set()`
block to `fn run()`. Also add a test that installs an `on_stall`
callback, configures a `Builder::thread_name`, triggers a stall, and
asserts the callback receives the configured name -- this fails without
the fix.
The fork delta (stall-detection feature, Builder::rng_seed stabilization,
publish workflow) is unchanged: 23 files, +2150/-47, identical to the
delta carried on anthropic-1.49.0. Upstream 1.49.0 -> 1.52.3 brings ~5
months of fixes; the two scheduler changes that touched our hook points
(LIFO-slot stealing, sharded spawn_blocking queue) were both reverted
upstream before 1.52.3, so the runtime structure the fork instruments is
unchanged.

Publish trigger and docs now reference the anthropic-1.52.3 branch. The
GitHub 'publish' environment branch restriction must be updated in repo
settings to match (cannot be done in code).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants