upstream changes#4
Open
mikehash wants to merge 990 commits into
Open
Conversation
…placing bytes directly
…riodically when syncing
#1527) * feat(ethrpc): observe JSON-RPC calls for block sync and emit error requests * refactor(ethrpc): naming of the observation method
clear() does not shrink a Go map's underlying bucket allocation, so a single very large block would leave hits at its high-water mark even after reset. Reinitialize with the maxPendingHits pre-size instead so the runtime can release the oversized allocation.
…borts sync could block forever while queueing block hashes if all workers were stuck retrying block not found, because the coordinator stopped reading worker aborts.
stop EVM sync from silently stalling when a backend's tip feed dies. Added subscription watchdogs (Tron poll-only) plus metrics
…olygon-ecosystem-token"
Arm lastSubNotifyNs at subscribe time, not only on the first tip advance. Liveness is otherwise stamped only when a header advances the tip, so a subscription that never delivers a usable header leaves it at 0 and keeps tipWatchdog's lastNs == 0 gate closed forever: the cached tip never refreshes and resyncIndex reports a silent syncNotNeeded. Seeding here lets a stalled feed age past the threshold so the watchdog polls and reconnects.
The monotonic tip guard rejects every newHeads header below the cached tip. That correctly rides out a transient lagging load-balancer node, but on a genuine backend rollback to a lower height the feed delivers only sub-tip heads, all rejected, so the cached tip freezes above the backend. The frozen tip then equals the local DB tip, so resyncIndex early-exits as syncNotNeeded and never reaches its GetBlockHash(localBestHeight) fork path — a silent stall with no error and no metric, the same class of bug the watchdog set out to kill. The watchdog already fires here: rejected headers don't refresh liveness, so lastSubNotifyNs ages past TipStaleThreshold. Let its post-stall poll regress the cached tip — refreshBestHeaderFromChain gains an allowRegress flag (hot path stays monotonic, the watchdog passes true). After the full stall window a still-lower backend tip is a real rollback, so the tip follows the backend down and the next resyncIndex reaches its fork/disconnect path. A fluke lower poll cannot corrupt state: resyncIndex re-confirms via an independent GetBlockHash, and a chain still at the higher tip simply re-advances on the next header. Emit watchdog_tip_rollback to distinguish it from a forward advance. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
EthereumRPC.bestHeaderTime was the timestamp the old getBestHeader used to trigger its 15-minute passive reconnect. That reconnect moved to tipWatchdog (keyed on lastSubNotifyNs), removing the field's only reader and leaving it written in two places but never read. Drop it so the struct no longer implies a tip-staleness clock that nothing consults. Tron keeps its own, separate bestHeaderTime (it cannot reach this unexported field) and is unaffected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
tipWatchdogTick re-armed liveness (markNotifyAlive) after its own tip poll. On a permanently silent ZeroMQ feed while the chain keeps producing, each poll advanced the tip and reset the clock, so blockbook_backend_subscription_age_seconds sawtoothed up to the threshold and back instead of climbing past it — the dead feed stayed invisible to any age-based alert, the very failure mode the watchdog exists to surface. Drop the re-arm so liveness is refreshed only by a real ZeroMQ delivery (newBlockNotifier), matching the EVM watchdog's invariant that its own poll never counts as feed health. Polling every sample interval while the feed is silent is the intended recovery, not a cost: Tron's seconds-apart blocks mean reaching the threshold is already an abnormal gap, and the poll keeps sync moving until ZeroMQ delivery resumes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Tron's setBestHeader accepts a lower height where EthereumRPC's monotonic guard rejects one. This is correct for Tron — its tip is always an HTTP re-query, not a feed header, so following the backend down is what surfaces a rollback and avoids the frozen-tip masking the EVM guard otherwise introduces. The flip side is that a load-balanced Tron RPC with a lagging node could regress the tip and trip a spurious fork (the case the EVM guard prevents), which is fine for the usual single-node java-tron backend. Document the choice and its boundary on setBestHeader so the asymmetry with EVM is deliberate and visible, and point at the EVM pattern to port if Tron is ever fronted by a load balancer. No behavior change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The fork walk-down treats an ErrBlockNotFound from the backend (remote == "") as "this height is forked, disconnect it". That is required to heal a real lower-height rollback, where the orphaned blocks genuinely no longer exist on the backend. The cost is that a load-balanced backend whose lagging node answers NotFound for a still- canonical block can over-disconnect; it is bounded and self-healing because the following resyncIndex re-connects those blocks. The naive alternative — stopping on NotFound — would be worse, leaving genuinely orphaned blocks connected and the index wedged ahead of the backend after a real rollback. Expand the existing one-line note into this rationale so the behavior reads as a deliberate tradeoff rather than an oversight. No behavior change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A non-positive rpc_timeout left BTC's http.Client with no timeout (a blocked backend could hang a sync RPC, and thus shutdown, forever) and made EVM's context.WithTimeout expire immediately. Clamp it to a finite default (15s, kept above the 10s trace_timeout so block traces still finish) in the BTC and EVM constructors; Tron's HTTP node clients inherit the clamped value. Tron's Shutdown only tore down ZeroMQ, leaving the RPC client open so an in-flight GetBlockHash / raw fetch / tip re-query could delay shutdown up to the RPC timeout. Close it via the new exported EthereumRPC.CloseRPC, mirroring EthereumRPC.Shutdown. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
EVM already aborts in-flight RPCs on shutdown via closeRPC, but BTC-family and Tron's HTTP node clients did not: a sync GetBlock/GetBlockHash issued just before SIGTERM could hold up shutdown until the RPC timeout fired. Give BitcoinRPC a per-client base context wired into every Call/callBatch request (http.NewRequestWithContext) and cancel it in Shutdown; the transport also uses DialContext so a blocked TCP connect is interrupted, not just an established request. All BitcoinRPC sync paths funnel through Call, so this covers every embedding coin with no per-coin change. Tron threads the same context into all its HTTP node calls (block enrichment, tx-detail fallbacks, mempool, broadcast); its rpc-client side is already covered by CloseRPC. Coins with their own HTTP client off the shared seam (dcr, nuls) are not prompt-cancelled and remain bounded by the clamped RPC timeout. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
On a fast/archive EVM chain (Avalanche 2s, Arbitrum 250ms) the resync loop keeps connecting blocks without returning, so IsSynchronized stays false and the status page reads Synchronized=false even when the index is at the tip. GetSystemInfo now reports the externally observable state: an EVM index within one block of the backend tip and freshly updated counts as synced. The staleness window keys off the per-coin averageBlockTimeMs config (stable, sub-second-safe, and the value the tip watchdog uses) instead of the runtime average, which reads 0 until enough block times are observed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…watchdog tipWatchdog is the sole healer for every silent-stall mode this branch fixed, and reconnectRPC runs on that single goroutine. dialRPC dialed with context.Background(), so a websocket backend that accepts the socket but never completes the upgrade (a load-balancer blackhole — exactly the stall the watchdog reconnects from) blocks the dial, and the watchdog, forever. The cached tip then stays frozen, resyncIndex keeps reporting a false syncNotNeeded, and sync silently stalls until a restart. Dial under a bounded context (dialTimeout). go-ethereum uses it only for the first handshake, so the established connection is unaffected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The silent stall this branch targets left the index frozen with the process healthy, so add the signals to catch it: - watchdog_tick event on blockbook_backend_subscription_events (EVM + Tron): a heartbeat for the lone watchdog goroutine. rate==0 means it stopped ticking (a parked healer) — the one failure the other watchdog metrics cannot show, since they only fire on an already-detected stall. - blockbook_synchronized gauge (0/1, every coin): mirrors /api/status inSync; a sustained 0 outside initial sync means the index is not keeping up. - refresh blockbook_tip_age_seconds on every resync outcome (moved into updateBackendInfo), so it climbs during a silent stall instead of waiting on the ~15-minute app-info loop. docs/sync.md documents the three stall-detection alert expressions. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.