Skip to content

upstream changes#4

Open
mikehash wants to merge 990 commits into
qtumproject:bump-qtum-22.1from
trezor:master
Open

upstream changes#4
mikehash wants to merge 990 commits into
qtumproject:bump-qtum-22.1from
trezor:master

Conversation

@mikehash
Copy link
Copy Markdown

No description provided.

cranycrane and others added 30 commits March 25, 2026 15:56
pragmaxim and others added 30 commits May 27, 2026 13:16
#1527)

* feat(ethrpc): observe JSON-RPC calls for block sync and emit error requests

* refactor(ethrpc): naming of the observation method
clear() does not shrink a Go map's underlying bucket allocation, so a
single very large block would leave hits at its high-water mark even
after reset. Reinitialize with the maxPendingHits pre-size instead so
the runtime can release the oversized allocation.
…borts

sync could block forever while queueing block hashes if all workers were stuck retrying block not found, because the coordinator stopped reading worker aborts.
stop EVM sync from silently stalling when a backend's tip feed dies. Added subscription watchdogs (Tron poll-only) plus metrics
  Arm lastSubNotifyNs at subscribe time, not only on the first tip advance. Liveness is otherwise stamped only when a header advances the tip, so a subscription that never delivers a usable header leaves it at 0 and keeps tipWatchdog's lastNs == 0 gate closed forever: the cached tip never refreshes and resyncIndex reports a silent syncNotNeeded. Seeding here lets a stalled feed age past the threshold so the watchdog polls and reconnects.
The monotonic tip guard rejects every newHeads header below the cached tip.
That correctly rides out a transient lagging load-balancer node, but on a
genuine backend rollback to a lower height the feed delivers only sub-tip
heads, all rejected, so the cached tip freezes above the backend. The frozen
tip then equals the local DB tip, so resyncIndex early-exits as syncNotNeeded
and never reaches its GetBlockHash(localBestHeight) fork path — a silent stall
with no error and no metric, the same class of bug the watchdog set out to kill.

The watchdog already fires here: rejected headers don't refresh liveness, so
lastSubNotifyNs ages past TipStaleThreshold. Let its post-stall poll regress
the cached tip — refreshBestHeaderFromChain gains an allowRegress flag (hot
path stays monotonic, the watchdog passes true). After the full stall window a
still-lower backend tip is a real rollback, so the tip follows the backend
down and the next resyncIndex reaches its fork/disconnect path. A fluke lower
poll cannot corrupt state: resyncIndex re-confirms via an independent
GetBlockHash, and a chain still at the higher tip simply re-advances on the
next header. Emit watchdog_tip_rollback to distinguish it from a forward advance.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
EthereumRPC.bestHeaderTime was the timestamp the old getBestHeader used to
trigger its 15-minute passive reconnect. That reconnect moved to tipWatchdog
(keyed on lastSubNotifyNs), removing the field's only reader and leaving it
written in two places but never read. Drop it so the struct no longer implies
a tip-staleness clock that nothing consults. Tron keeps its own, separate
bestHeaderTime (it cannot reach this unexported field) and is unaffected.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
tipWatchdogTick re-armed liveness (markNotifyAlive) after its own tip poll.
On a permanently silent ZeroMQ feed while the chain keeps producing, each poll
advanced the tip and reset the clock, so blockbook_backend_subscription_age_seconds
sawtoothed up to the threshold and back instead of climbing past it — the dead
feed stayed invisible to any age-based alert, the very failure mode the watchdog
exists to surface.

Drop the re-arm so liveness is refreshed only by a real ZeroMQ delivery
(newBlockNotifier), matching the EVM watchdog's invariant that its own poll never
counts as feed health. Polling every sample interval while the feed is silent is
the intended recovery, not a cost: Tron's seconds-apart blocks mean reaching the
threshold is already an abnormal gap, and the poll keeps sync moving until ZeroMQ
delivery resumes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Tron's setBestHeader accepts a lower height where EthereumRPC's monotonic guard
rejects one. This is correct for Tron — its tip is always an HTTP re-query, not a
feed header, so following the backend down is what surfaces a rollback and avoids
the frozen-tip masking the EVM guard otherwise introduces. The flip side is that a
load-balanced Tron RPC with a lagging node could regress the tip and trip a spurious
fork (the case the EVM guard prevents), which is fine for the usual single-node
java-tron backend.

Document the choice and its boundary on setBestHeader so the asymmetry with EVM is
deliberate and visible, and point at the EVM pattern to port if Tron is ever fronted
by a load balancer. No behavior change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The fork walk-down treats an ErrBlockNotFound from the backend (remote == "") as
"this height is forked, disconnect it". That is required to heal a real lower-height
rollback, where the orphaned blocks genuinely no longer exist on the backend. The
cost is that a load-balanced backend whose lagging node answers NotFound for a still-
canonical block can over-disconnect; it is bounded and self-healing because the
following resyncIndex re-connects those blocks. The naive alternative — stopping on
NotFound — would be worse, leaving genuinely orphaned blocks connected and the index
wedged ahead of the backend after a real rollback.

Expand the existing one-line note into this rationale so the behavior reads as a
deliberate tradeoff rather than an oversight. No behavior change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A non-positive rpc_timeout left BTC's http.Client with no timeout (a blocked
backend could hang a sync RPC, and thus shutdown, forever) and made EVM's
context.WithTimeout expire immediately. Clamp it to a finite default (15s, kept
above the 10s trace_timeout so block traces still finish) in the BTC and EVM
constructors; Tron's HTTP node clients inherit the clamped value.

Tron's Shutdown only tore down ZeroMQ, leaving the RPC client open so an
in-flight GetBlockHash / raw fetch / tip re-query could delay shutdown up to the
RPC timeout. Close it via the new exported EthereumRPC.CloseRPC, mirroring
EthereumRPC.Shutdown.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
EVM already aborts in-flight RPCs on shutdown via closeRPC, but BTC-family and
Tron's HTTP node clients did not: a sync GetBlock/GetBlockHash issued just before
SIGTERM could hold up shutdown until the RPC timeout fired.

Give BitcoinRPC a per-client base context wired into every Call/callBatch request
(http.NewRequestWithContext) and cancel it in Shutdown; the transport also uses
DialContext so a blocked TCP connect is interrupted, not just an established
request. All BitcoinRPC sync paths funnel through Call, so this covers every
embedding coin with no per-coin change. Tron threads the same context into all
its HTTP node calls (block enrichment, tx-detail fallbacks, mempool, broadcast);
its rpc-client side is already covered by CloseRPC.

Coins with their own HTTP client off the shared seam (dcr, nuls) are not
prompt-cancelled and remain bounded by the clamped RPC timeout.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
On a fast/archive EVM chain (Avalanche 2s, Arbitrum 250ms) the resync loop keeps
connecting blocks without returning, so IsSynchronized stays false and the status
page reads Synchronized=false even when the index is at the tip.

GetSystemInfo now reports the externally observable state: an EVM index within one
block of the backend tip and freshly updated counts as synced. The staleness
window keys off the per-coin averageBlockTimeMs config (stable, sub-second-safe,
and the value the tip watchdog uses) instead of the runtime average, which reads 0
until enough block times are observed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…watchdog

tipWatchdog is the sole healer for every silent-stall mode this branch
fixed, and reconnectRPC runs on that single goroutine. dialRPC dialed
with context.Background(), so a websocket backend that accepts the socket
but never completes the upgrade (a load-balancer blackhole — exactly the
stall the watchdog reconnects from) blocks the dial, and the watchdog,
forever. The cached tip then stays frozen, resyncIndex keeps reporting a
false syncNotNeeded, and sync silently stalls until a restart.

Dial under a bounded context (dialTimeout). go-ethereum uses it only for
the first handshake, so the established connection is unaffected.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The silent stall this branch targets left the index frozen with the process
healthy, so add the signals to catch it:

- watchdog_tick event on blockbook_backend_subscription_events (EVM + Tron):
  a heartbeat for the lone watchdog goroutine. rate==0 means it stopped
  ticking (a parked healer) — the one failure the other watchdog metrics
  cannot show, since they only fire on an already-detected stall.
- blockbook_synchronized gauge (0/1, every coin): mirrors /api/status inSync;
  a sustained 0 outside initial sync means the index is not keeping up.
- refresh blockbook_tip_age_seconds on every resync outcome (moved into
  updateBackendInfo), so it climbs during a silent stall instead of waiting
  on the ~15-minute app-info loop.

docs/sync.md documents the three stall-detection alert expressions.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants