Summary
Bitswap taskWorker goroutines in boxo/bitswap/server spend up to DefaultNegotiationTimeout (10s) each on stream.Close() when the remote peer is slow or silent during the multistream lazy handshake. Each stalled Close holds a worker until its SetReadDeadline expires. On a busy node with even a small fraction of unresponsive peers, the TaskWorkerCount pool saturates and bitswap stops serving blocks to new peers.
We observed this on collab-cluster nodes running kubo v0.41.0-rc2 (boxo v0.38.0) while investigating #1141. The earlier fix in #1083 (boxo v0.36.0) set a read deadline to bound each individual Close, and that deadline works as intended. The remaining problem is structural: 10s per worker is still too long under realistic churn bigger storage provider may experience.
Environment
| Component |
Version |
| kubo |
v0.41.0-rc2 |
| boxo |
v0.38.0 |
| go-libp2p |
v0.48.0 |
| go-multistream |
v0.6.1 |
| Observed on |
collab-cluster-rbx8.cluster.dwebops.net |
Call path
Every busy taskWorker goroutine on the affected host sat at the same frame:
boxo/bitswap/server.taskWorker (server.go:278)
boxo/bitswap/server.sendBlocks (server.go:335)
boxo/bitswap/network.router.SendMessage (router.go:104)
boxo/bitswap/network/bsnet.impl.SendMessage (ipfs_impl.go:401)
go-libp2p/p2p/host/basic.streamWrapper.Close (basic_host.go:691)
multistream.lazyClientConn.Close (lazyClient.go:196)
multistream.once.Do (lazyClient.go:53 / 58)
...doReadHandshake... (lazyClient.go:107)
...stream.Read blocked in yamux or quic-go...
lazyClientConn.Close() finishes the handshake it skipped during the write path by reading a varint token. If the peer is gone or silent, that read blocks until the SetReadDeadline set in bsnet.SendMessage expires.
How widespread is this on a running node
On rbx8 at the moment of capture (kubo v0.41.0-rc2, 12-hour uptime, TaskWorkerCount=12):
- 8 of 12 workers sat in this frame.
vole bitswap check against this node timed out in 3 of 3 trials (30s timeout).
A snapshot on a v0.40.1 sibling node showed 5 of 12 workers in the same frame, with partial probe success.
Reproduction
A minimal reproducer uses net.Conn and go-multistream, with no libp2p involved:
- Silent server: TCP listener that accepts and holds the connection without responding.
- Client:
NewMSSelect, then Write(payload), then optional SetReadDeadline, then Close.
- Measure elapsed time for
Close.
Results:
- Without
SetReadDeadline: Close stuck indefinitely (tested to 10s cap).
- With
SetReadDeadline(2s): Close returns in 2.000289693s. The deadline chain works.
Quantifying the pool-saturation effect
A 12-worker pool processes 200 tasks, 10 percent of which have a silent peer, with a 10s close deadline (same as production):
| Variant |
Completed |
Time |
Tasks/s |
Current path: worker waits for Close |
194 / 200 |
20s (budget capped) |
9.7 |
Decoupled: Close runs in a goroutine, worker returns after Write |
200 / 200 |
5ms |
~40,000 |
At 10 percent unresponsive peers, the synchronous close drops effective throughput by about 4000x. This matches the field observation: engine -> msg fires for thousands of other peers in the same window that a new peer sees zero.
Why it makes sense to fix this in boxo
Two upstream locations could bound Close directly: go-multistream (in lazyClientConn.Close) and go-libp2p (in streamWrapper.Close). Either would help every consumer.
Some reasons to prefer a boxo-local fix:
go-multistream is used by implementations beyond libp2p. Changing Close semantics, even to "bounded with a timeout", could surprise older code that depends on the blocking guarantee for correctness (the QUIC STOP_SENDING edge case the current code comment describes).
- The future of
go-libp2p maintenance is uncertain, so an upstream round trip for behaviour that mainly matters to bitswap may take a while to land.
- Bitswap has a clear opinion: once the message bytes are written, the multistream ack is not needed to consider the send done. This is a bitswap-specific choice, and boxo is a natural place for it.
A boxo-local fix keeps the blast radius small, ships on boxo's release cadence, and avoids renegotiating Close semantics with every other consumer of go-multistream.
Proposed fix
In bsnet.SendMessage, after msgToStream succeeds, move Close to a goroutine and return. The worker is freed as soon as the bytes are written.
// Sketch; not final.
if err := bsnet.msgToStream(ctx, s, outgoing, timeout); err != nil {
_ = s.Reset()
return err
}
go func(s network.Stream, deadline time.Time) {
_ = s.SetReadDeadline(deadline)
_ = s.Close()
}(s, time.Now().Add(timeout))
return nil
Trade-offs:
- The caller no longer sees errors from
Close. Current callers in boxo log them and move on, and the bitswap message itself is already written to the stream before Close is reached.
- Each send spawns one goroutine. Goroutines are cheap, and the in-flight total is already bounded by the deadline. In steady state, the close-goroutine pool size is about
send_rate * deadline, which is the same work the task-worker pool holds today. The change moves that holding away from the workers.
streamMessageSender (the persistent per-peer sender) uses a different code path so not impacted (but will double check in a PR)
Related
Artifacts
Two goroutines.stacks files from collab cluster production (Kubo v0.40.1 and v0.41.0-rc2), the full ipfs diag profile bundles, and the two reproducer programs are available on the boxes.
Next steps
I will open a PR with the proposed fix, plus a regression test that mirrors the pool-saturation benchmark.
Summary
Bitswap
taskWorkergoroutines inboxo/bitswap/serverspend up toDefaultNegotiationTimeout(10s) each onstream.Close()when the remote peer is slow or silent during the multistream lazy handshake. Each stalledCloseholds a worker until itsSetReadDeadlineexpires. On a busy node with even a small fraction of unresponsive peers, theTaskWorkerCountpool saturates and bitswap stops serving blocks to new peers.We observed this on collab-cluster nodes running kubo v0.41.0-rc2 (boxo v0.38.0) while investigating #1141. The earlier fix in #1083 (boxo v0.36.0) set a read deadline to bound each individual
Close, and that deadline works as intended. The remaining problem is structural: 10s per worker is still too long under realistic churn bigger storage provider may experience.Environment
collab-cluster-rbx8.cluster.dwebops.netCall path
Every busy
taskWorkergoroutine on the affected host sat at the same frame:lazyClientConn.Close()finishes the handshake it skipped during the write path by reading a varint token. If the peer is gone or silent, that read blocks until theSetReadDeadlineset inbsnet.SendMessageexpires.How widespread is this on a running node
On rbx8 at the moment of capture (kubo v0.41.0-rc2, 12-hour uptime,
TaskWorkerCount=12):vole bitswap checkagainst this node timed out in 3 of 3 trials (30s timeout).A snapshot on a v0.40.1 sibling node showed 5 of 12 workers in the same frame, with partial probe success.
Reproduction
A minimal reproducer uses
net.Connandgo-multistream, with no libp2p involved:NewMSSelect, thenWrite(payload), then optionalSetReadDeadline, thenClose.Close.Results:
SetReadDeadline:Closestuck indefinitely (tested to 10s cap).SetReadDeadline(2s):Closereturns in 2.000289693s. The deadline chain works.Quantifying the pool-saturation effect
A 12-worker pool processes 200 tasks, 10 percent of which have a silent peer, with a 10s close deadline (same as production):
CloseCloseruns in a goroutine, worker returns afterWriteAt 10 percent unresponsive peers, the synchronous close drops effective throughput by about 4000x. This matches the field observation:
engine -> msgfires for thousands of other peers in the same window that a new peer sees zero.Why it makes sense to fix this in boxo
Two upstream locations could bound
Closedirectly:go-multistream(inlazyClientConn.Close) andgo-libp2p(instreamWrapper.Close). Either would help every consumer.Some reasons to prefer a boxo-local fix:
go-multistreamis used by implementations beyond libp2p. ChangingClosesemantics, even to "bounded with a timeout", could surprise older code that depends on the blocking guarantee for correctness (the QUICSTOP_SENDINGedge case the current code comment describes).go-libp2pmaintenance is uncertain, so an upstream round trip for behaviour that mainly matters to bitswap may take a while to land.A boxo-local fix keeps the blast radius small, ships on boxo's release cadence, and avoids renegotiating
Closesemantics with every other consumer ofgo-multistream.Proposed fix
In
bsnet.SendMessage, aftermsgToStreamsucceeds, moveCloseto a goroutine and return. The worker is freed as soon as the bytes are written.Trade-offs:
Close. Current callers in boxo log them and move on, and the bitswap message itself is already written to the stream beforeCloseis reached.send_rate * deadline, which is the same work the task-worker pool holds today. The change moves that holding away from the workers.streamMessageSender(the persistent per-peer sender) uses a different code path so not impacted (but will double check in a PR)Related
stream.Close()blocks indefinitely on unresponsive peers #1083: addedSetReadDeadlinebeforeCloseinbitswap/SendMessage(boxo v0.36.0). That fix bounds each individual hang and does exactly what it set out to do. This issue addresses the remaining worker occupancy during the bounded hang, which the earlier fix could not tackle on its own.DefaultPeerComparatoron a busy server #1141: the issue where this was traced. The bitswap silence to new peers reported there is the downstream symptom of this pool saturation.Artifacts
Two
goroutines.stacksfiles from collab cluster production (Kubo v0.40.1 and v0.41.0-rc2), the fullipfs diag profilebundles, and the two reproducer programs are available on the boxes.Next steps
I will open a PR with the proposed fix, plus a regression test that mirrors the pool-saturation benchmark.