Skip to content

session: fix EnqueueInitialData reversing byte order on multi-call writers#147

Merged
Kianmhz merged 1 commit into
mainfrom
fix/enqueue-initial-data-ordering
May 21, 2026
Merged

session: fix EnqueueInitialData reversing byte order on multi-call writers#147
Kianmhz merged 1 commit into
mainfrom
fix/enqueue-initial-data-ordering

Conversation

@Kianmhz
Copy link
Copy Markdown
Owner

@Kianmhz Kianmhz commented May 21, 2026

Summary

`EnqueueInitialData` was implemented to prepend to the tx buffer while `synNeeded` was true. The assumption was "called once, before the SYN drains." But the SOCKS5 adapter calls it on every `Write`, and a fast local writer can fit many calls before a poll worker drains the SYN. Each prepend pushed new bytes in front of the existing buffer, so the on-wire byte order ended up reversed.

For any protocol whose leading bytes carry framing — a TLS record header, an HTTP request line, a length prefix — the upstream parses garbage or rejects the stream. The bug has existed since the connect_data optimization landed in v1.4.

Specific consequence: every upload-throughput bench result has been wrong

The bench's `sized` sink (`bench/sink/main.go`) reads the first 8 bytes as a size header. With prepend, those 8 bytes came from the last chunk written before the SYN drained — usually a body chunk full of zeros, decoded as `size=0`. The sink ACKed instantly without reading the body, and the harness measured the time-to-instant-ACK as upload throughput.

That made v1.5 look like 22 MB/s when actual upload is ~5 MB/s. The whole "v1.5→v1.6 upload regression" narrative behind #143 was a bench artifact. With the prepend fixed and `workersPerEndpoint` tested at 3, 4, and 5:

workersPerEndpoint 8 MB single-session upload 8 MB 4-session upload sessions_per_sec
3 (current main) 4.49 MB/s ~17 MB/s 4.5/s
4 (pre-#143) 4.49 MB/s ~17 MB/s 50/s
5 4.49 MB/s 17.8 MB/s 90/s

The per-session ceiling is maxDrainFramesPerSession × MaxFramePayload / ActiveDrainWindow ≈ 5.85 MB/s. Worker count doesn't change it. 4 concurrent sessions get 4 × the per-session number, as you'd expect.

What does NOT change

  • The carrier protocol is unchanged. This is purely a local-buffer correctness fix.
  • Existing single-write callers (the common case for new connections) are unaffected: with one Write before drain, append vs prepend gives the same result.
  • The connect_data optimization still works — data still bundles into the SYN frame.

What DOES change for real users

  • Any application that writes a length-prefixed or framed payload immediately after connecting was hitting silent corruption on the first message. TLS handshakes were OK because ClientHello is usually one Write that fits in the SYN frame. HTTP/1.1 was OK because the first Write usually fits the request line + headers. But anything that issued the request line and the body as separate Writes within microseconds of `Dial()` could see the request body arrive before the request line on the upstream.
  • This may have caused obscure first-request failures users couldn't reproduce.

What's next (separate work)

This fix means #143's premise (v1.5→v1.6 upload regression caused by workersPerEndpoint=4) was wrong — the upload numbers were always bench artifacts. A follow-up should:

  1. Revert #143 (back to workersPerEndpoint=4) — no upload regression to worry about, get back the v1.6 session-open speedup.
  2. Reconsider the spike in #146 on its real merits (unlabeled-multi-endpoint configs went from v1.5's 15 workers to v1.6's 4 — that part of the regression is real).
  3. To genuinely improve per-session upload throughput beyond ~6 MB/s, the per-session in-flight lock and the 350 ms ActiveDrainWindow need design work — out of scope here.

Verification

  • go test -count=1 -timeout 90s ./... — all green; new test catches the bug.
  • ./bench/bench.sh --baseline v1.6.0 --scenario throughput_up_8MB_1session — 3 consecutive runs all report 4.49 MB/s (vs the previous lying 22.3 MB/s).

🤖 Generated with Claude Code

…iters

EnqueueInitialData was implemented to prepend rather than append, on the
assumption it would be called once before the SYN drains. But the SOCKS5
adapter calls it on every Write, and a fast local writer (the bench harness,
or any application that issues several small Writes immediately after
connecting) can fit many calls before a poll worker drains the SYN. Each
prepend pushed the new data in front of what was already buffered, so the
on-wire byte order ended up reversed.

For any protocol whose leading bytes carry framing — a TLS record header, an
HTTP request line, a length prefix — the upstream parses garbage or rejects
the stream. The bug has existed since the connect_data optimization landed
in v1.4.

A specific consequence worth flagging: every upload-throughput bench result
ever recorded for this project has been wrong. The bench's sized-sink reads
the first 8 bytes as a size header. With prepend, those 8 bytes came from
the LAST chunk written before the SYN drained — usually a body chunk full
of zeros, decoded as size=0. The sink ACKed instantly without reading the
body, and the harness measured the time-to-instant-ACK as upload throughput.
That made v1.5 look like 22 MB/s when actual upload was ~5 MB/s. The
"v1.5→v1.6 upload regression" narrative behind #143 was a bench artifact;
the real per-session ceiling is `maxDrainFramesPerSession × MaxFramePayload /
ActiveDrainWindow` ≈ 5.85 MB/s in both versions.

The fix is one-line (append instead of prepend); the comment is the
expensive part. Test added to catch the regression directly.
@Kianmhz Kianmhz merged commit 6340afb into main May 21, 2026
6 checks passed
@Kianmhz Kianmhz deleted the fix/enqueue-initial-data-ordering branch May 21, 2026 05:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant