Skip to content

Backend reliability hardening: resumable Teleporter, metric-service dedup, health/shutdown, dead-code cleanup#97

Merged
muhammetselimfe merged 3 commits into
mainfrom
backend-reliability-cleanup
May 31, 2026
Merged

Backend reliability hardening: resumable Teleporter, metric-service dedup, health/shutdown, dead-code cleanup#97
muhammetselimfe merged 3 commits into
mainfrom
backend-reliability-cleanup

Conversation

@muhammetselimfe
Copy link
Copy Markdown
Collaborator

Summary

A batch of low-risk reliability and maintainability fixes for the backend. Net −1,624 lines (large dead-code/duplication removal) with +19 tests.
Full suite: 208 passing.

No behavior changes to public API responses — the metric refactor preserves every existing method name and the Teleporter weekly result is identical to
the old single-pass aggregation.

What's in here

1. Resumable Teleporter weekly update

The ~10h, 7-day ICM fetch previously ran as a single in-memory pass — a redeploy or crash mid-run lost all progress and restarted from scratch on the
next cron tick. It now fetches day-by-day, checkpointing each completed day to partialResults (anchored to a stored referenceEndTime so resuming
hours later still yields a consistent 7-day window). After a restart, the next run resumes from the next unfetched day. The existing
lock/heartbeat/ownership machinery is preserved.

2. Deduplicate metric services

Six near-identical ~307-line services (activeAddresses, txCount, gasUsed, avgGasPrice, maxTps, feesPaid) collapsed into a single
createMetricService factory + thin wrappers (~1,370 fewer lines). Public method names are preserved, so no route/cron caller changed. Network
sum-vs-avg aggregation is parameterized (only avgGasPrice uses avg).

3. Graceful shutdown + health checks

  • SIGTERM/SIGINT now drain the HTTP server and close the Mongo connection, with a 10s force-exit backstop.
  • /health (liveness, always 200 + uptime/mongo info), /health/ready (readiness, 503 when Mongo is down), /health/dependencies (active
    Glacier/Metrics probe, cached 30s so it can't be abused to burn the upstream rate budget).

4. Dead code & deployment cleanup

  • Deleted unused fetchGlacierData.js, chainDataService.js, and a stale tracked teleporterService 2.js.
  • Production is a long-lived DigitalOcean process (Vercel fully retired), so removed vercel.json, the vercel-build script, and the isVercel
    branches; updated the README deployment docs.
  • Scrubbed phantom TVL/DefiLlama references from the README.

Review fixes folded in

This branch also addresses findings from a self-review of the above:

  • Teleporter day-windows use a half-open interval [startTime, endTime) so a message on a shared 24h boundary is counted in exactly one window (no
    double-count).
  • metricService.updateData returns an explicit { success: false } instead of undefined when the API returns a non-array payload on all retries
    (previously crashed updateAllChains).
  • /health/dependencies treats an unconfigured base URL as degraded rather than healthy.

Testing

  • 208 tests pass (npm test).
  • New tests cover: health endpoints, the metric factory's sum/avg aggregation and its retry-failure return, the Teleporter resume flow (proves days 1–3
    aren't refetched after a simulated day-4 crash), and the day-window boundary assignment.

muhammetselimfe and others added 3 commits May 29, 2026 18:58
Five reliability/maintainability fixes:

1. Remove dead code & config drift
   - Delete unused fetchGlacierData.js and chainDataService.js
   - Scrub phantom TVL/DefiLlama references from README

2. Graceful shutdown + dependency health checks
   - /health (liveness), /health/ready (readiness; 503 when Mongo down),
     /health/dependencies (active Glacier/Metrics probe, off the hot path)
   - SIGTERM/SIGINT drain the HTTP server and close Mongo, with a 10s
     force-exit backstop

3. Deduplicate metric services
   - Collapse 6 near-identical services (activeAddresses, txCount, gasUsed,
     avgGasPrice, maxTps, feesPaid) into a createMetricService factory + thin
     wrappers (~1,370 fewer lines). Public method names preserved (no caller
     churn); network sum-vs-avg aggregation parameterized.

4. Resumable Teleporter weekly update
   - Split the ~10h 7-day fetch into per-day windows checkpointed to
     partialResults, anchored to a stored referenceEndTime. A restart/crash
     now resumes from the next unfetched day instead of starting over. Final
     merge is identical to the old single-pass aggregation.

5. Resolve Vercel/cron deployment fit
   - Production is a long-lived DigitalOcean process (node-cron reliable).
     Remove stale vercel.json, vercel-build script, and isVercel branches;
     document the real deployment in README.

Tests: +11 (health endpoints, metric factory aggregation, Teleporter resume
proving days 1-3 are not refetched after a day-4 crash). Full suite: 206 pass.

Also includes a pre-existing working-tree change: nodemonConfig ignore rules
in package.json.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Teleporter weekly: use a half-open window [startTime, endTime) when filtering
  messages so adjacent day-windows (which share a boundary timestamp) no longer
  both claim a message landing exactly on the boundary, which double-counted it
  after merge. Also corrects the misleading "no-op" comment on the daily path.
- /health/dependencies: cache the deep probe for 30s so the endpoint can't be
  used to hammer Glacier/Metrics and burn the rate budget the cron jobs rely on;
  treat an unconfigured base URL as degraded (not healthy), since those env vars
  are required.
- Delete stale tracked duplicate src/services/teleporterService 2.js (old
  non-resumable logic, unreferenced).
- Add a regression test proving a boundary-timestamp message is assigned to
  exactly one window.

Full suite: 207 pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ponse

updateData fell off the retry loop and resolved to undefined when the Metrics
API kept returning a 200 with a non-array `results` payload (each attempt hit
the `continue`). updateAllChains then crashed on `undefined.success`. Return an
explicit { success:false } result instead, with a regression test.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@muhammetselimfe muhammetselimfe merged commit 4ccb020 into main May 31, 2026
4 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant