Skip to content

feat: opt-in mobile client-RX coverage (crowdsourced RF reach) + /api/nodes/resolve#1728

Open
efiten wants to merge 41 commits into
Kpa-clawbot:masterfrom
efiten:feat/client-rx-coverage-pr
Open

feat: opt-in mobile client-RX coverage (crowdsourced RF reach) + /api/nodes/resolve#1728
efiten wants to merge 41 commits into
Kpa-clawbot:masterfrom
efiten:feat/client-rx-coverage-pr

Conversation

@efiten

@efiten efiten commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Implements #1727.

What this adds

Mobile client-RX coverage — an opt-in, crowdsourced RF-coverage feature. A roaming MeshCore companion radio (driven by the open-source corescope-rx PWA, GPLv3) reports which nodes it heard directly, tagged with the phone's GPS and the packet's SNR/RSSI. CoreScope ingests these into a new client_receptions table and renders per-node hex coverage on the Reach page, plus a standalone Coverage dashboard (#/rx-coverage) with a top-mobile-observers leaderboard.

Also includes GET /api/nodes/resolve?prefix=<hex> — a read-only node-name lookup by pubkey prefix ({name, pubkey, ambiguous}), used by the companion app for friendly names.

Opt-in — default OFF (zero impact on existing deployments)

The whole feature is gated behind one config flag, disabled by default:

"clientRxCoverage": { "enabled": false }

When disabled (the default): the ingestor writes no client_receptions; the three coverage endpoints return a clean 404; the UI hides the Coverage nav link, the #/rx-coverage route, and the Reach-page toggle. /api/nodes/resolve is always available (not coverage-specific).

How it works

companion ──BLE 0x88 (snr+rssi+raw)──▶ corescope-rx PWA ──▶ MQTT meshcore/client/{pubkey}/packets
                                                                      │
                                          ingestor (gated) ──▶ client_receptions (GPS + SNR + heard-key)
                                                                      │
              server: pure-Go hex grid ──▶ GeoJSON ──▶ Reach hex overlay + Coverage dashboard
  • Direct-only capture: records only what the companion heard itself and directly — a 0-hop advert's pubkey, or path[last] (last forwarder) for FLOOD routes; ≥2-byte path-hash required. Upstream hops discarded.
  • No new deps: hexbins are a pure-Go pointy-top grid over Web Mercator (cmd/server/hexgrid.go) computed at query time (CGO_ENABLED=0 / modernc.org/sqlite friendly); frontend uses the existing Leaflet.
  • Trust: companion pubkey = identity; an EMQX ACL binds each client to publish only to its own meshcore/client/{pubkey}/packets topic. Payload contract in docs/client-rx-coverage.md.

How to enable / try it

  1. In config.json, set "clientRxCoverage": { "enabled": true } and restart server + ingestor.
  2. Point an EMQX (or any broker) listener so a client can publish to meshcore/client/<pubkey>/packets; the ingestor already subscribes under meshcore/#.
  3. Run the corescope-rx PWA on an Android phone paired (BLE) to a MeshCore companion — it captures heard nodes + GPS and publishes.
  4. View results: per-node Reach page → toggle coverage, or the Coverage dashboard at #/rx-coverage.

What's where

  • Ingestor: cmd/ingestor/client_reception.go (ingest), db.go (client_receptions + client_observers schema), main.go (gated dispatch), config.go (flag).
  • Server: cmd/server/rx_coverage.go + rx_dashboard.go (endpoints, self-guard 404 when off), hexgrid.go (pure-Go grid), node_resolve.go (resolve), routes.go / types.go / config.go (wiring + flag + /api/config/client field).
  • Frontend: public/rx-coverage.js (dashboard), node-reach-coverage.js + .css (overlay), node-reach.js (Reach toggle, flag-gated), roles.js (reads the flag, hides nav when off).
  • Docs: docs/client-rx-coverage.md.

Testing

  • Go: cd cmd/server && go test ./... and cd cmd/ingestor && go test ./... — green, including new gate tests (coverage_gate_test.go in both: off → no rows / 404, on → works) and the rx-coverage / resolve / hexgrid suites.
  • JS: node test-coverage-gate.js, node test-node-reach-coverage.js (wired into CI). The Playwright test-node-reach-coverage-e2e.js is wired into the e2e job and skips when clientRxCoverage is disabled, so it's safe under the default-off config.

Notes for reviewers

  • The four new routes are registered in cmd/server/openapi_known_gaps.json (the existing OpenAPI-completeness ratchet), matching how other not-yet-spec'd routes are tracked. Happy to write full OpenAPI spec entries instead if you prefer.
  • Commits are split per layer (ingestor / server endpoints / resolve / frontend / CI) for review.

efiten and others added 11 commits June 14, 2026 23:47
…ated)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… gate

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…default-off)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…rescope-rx)

The PR was missing docs/client-rx-coverage.md (the MQTT payload contract) and gave
operators/users no pointer to the mobile capture app. Add the doc with a 'Companion
app' section + operator enable steps, link corescope-rx from the config.example.json
flag comment, and add a 'Get the companion app' link on the Coverage dashboard.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… it in static nav)

The static nav link broke upstream's nav-overflow e2e (test-nav-priority-1391):
it counts all .nav-link elements regardless of display, so a hidden opt-in link
still failed the expected-nav-set assertion. Remove the link from index.html and
inject it from roles.js after Analytics only when clientRxCoverage is enabled,
nudging applyNavPriority via a resize event. Default-off nav now exactly matches
upstream (deterministic CI), and the link appears when the feature is on.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@Kpa-clawbot

Copy link
Copy Markdown
Owner

Kent Beck Gate (round 1) — TDD + test quality

Verdict: APPROVED (with TDD-history caveat noted under net-new-feature exemption).


TDD history

PR commit log on origin/master..pr-1728-ro (first three non-merge commits):

# SHA Subject Files
1 933ab3da feat(coverage): frontend — Coverage dashboard + Reach overlay (flag-gated) 5 prod (public/*.{js,css,html}) + 3 tests (test-*.js)
2 1af0f16b feat(coverage): ingestor — client_receptions schema + ingest + opt-in gate 5 prod (cmd/ingestor/*.go) + 2 tests
3 a5c6a9c4 feat(coverage): server — GeoJSON hex coverage endpoints + opt-in gate 8 prod + 5 tests + openapi/config

No pure red commit precedes any of the three feature commits — production and tests land together per layer.

Strict red-then-green: FAIL.
Net-new-feature exemption (AGENTS.md): APPLIES. This is a brand-new opt-in feature surface (client_receptions table, /api/rx-coverage*, /api/nodes/{pk}/rx-coverage, frontend Coverage dashboard, default-off behind clientRxCoverage). There are no pre-existing tests to break and nothing to regress on by definition. AGENTS.md grants the limited exemption iff the tests land in the same PR and assert real behavior rather than stubs.

I verified both halves of the exemption (see test-quality below). Per-commit CI history is no longer queryable on individual SHAs (gh run list --commit <sha> returns [] for all of 933ab3da/1af0f16b/a5c6a9c4/...), so I cannot produce a "first commit was red, second was green" CI proof — that history is gone. Current gh pr checks 1728: Go Build & Test pass (11m31s), Playwright E2E pass (15m7s), Docker build pass.

Recommendation for next time on net-new layers: split each layer into two commits — add failing test, then make it pass — so the discipline is auditable on the branch even when CI retention drops.


Test-quality interrogation (anti-tautology + six questions)

cmd/ingestor/coverage_gate_test.go — NOT a tautology.
TestClientRxCoverageGateOff / GateOn build a real mockMessage on meshcore/client/<pk>/packets with a valid relayed-advert raw hex + GPS, drive the actual handleMessage with cfg.ClientRxCoverage nil vs. enabled, then assert SELECT COUNT(*) FROM client_receptions is 0 vs 1. Reverting the cfg.ClientRxCoverageEnabled() branch in main.go:541 flips the assertion. Reverting the handleClientPacketInsertClientReception chain also breaks GateOn. The "ON" test would catch any silent drop in the dispatch path; the "OFF" test would catch a gate that was wired to the wrong predicate. Setup is the helper clientCoverageMsg() (1 line per test) — proportional to the assertion. ✅

cmd/server/coverage_gate_test.go — NOT a tautology.
Builds a real *mux.Router via srv.RegisterRoutes(router) and asserts: (a) /api/rx-coverage returns 404 when disabled, (b) /api/config/client body contains "clientRxCoverage":false when disabled / :true when enabled, (c) the route is registered (≠ 404) when enabled. Reverting the route-registration gate would flip (a); reverting the config exposure flips (b). Six Qs: name describes behavior (TestCoverageRoutesGatedOff); a wrong impl that always-registers would fail (a); a wrong impl that hardcodes false in config/client would fail (b). ✅

cmd/server/rx_coverage_test.go — strong behavior coverage.

  • TestAggregateCoverageBucketsBestSNR — two rows in the same H3 cell, asserts max-SNR wins, count=2, HasSig true, polygon geometry. A wrong impl that took last SNR (-12) would fail.
  • TestAggregateCoverageGreyWhenNoSignal — single row, no SNR, asserts cell renders as no-sig.
  • TestAggregateCoverageNodeBreakdown — three nodes, asserts strongest-first sort, latest-by-RxAt SNR per node, count aggregation, no-sig node sorted last. Catches any wrong sort or "first SNR wins" regression.
  • TestAggregateCoverageMergesResolvedNodes — same node under prefix + full pubkey, asserts merge to 1 entry, count=3, latest SNR (-9). This is the test that justifies the resolver lambda existing — without it the test fails with 2 nodes.
  • TestResolveHeardKey — unique / ambiguous / unknown / empty prefix cases. Reverting the ambiguity check returns Alice for aabbcc — fails immediately.
  • TestHexSizeRendersConstantPx — for res 4..16, asserts on-screen pixel size is constant ~hexTargetPx and halves per zoom step. This is a genuine math invariant — a wrong constant or off-by-one in mercUPPZ0 math fails.

None of these read like "test the stub returns what the stub returns." Every test would catch a plausible bug.

cmd/ingestor/client_reception_test.go — solid.
TestDeriveHeardKey covers the routing-type matrix (Flood OK, Direct rejected, TransportDirect rejected, 1-byte rejected, TX rejected, no-hops-non-advert rejected). TestBuildClientReception covers happy-path + Direct rejection + lat out of range. Names describe behavior. Edge case I'd add: 32-byte-but-non-hex decoded.Payload.PubKey on an advert — currently deriveHeardKey would accept it because the OK path checks keyLen==32 not hex validity (strings.ToUpper(full) is hex-only by the upstream decoder, so this is probably defensive paranoia, not a real bug). Not a blocker.

cmd/server/hexgrid_test.go — stable+distinct cell, closed-ring boundary, malformed-cell-returns-nil. Minimal, would catch any breakage in the H3 binding.

cmd/server/node_resolve_test.go — unique / ambiguous (1-byte collision) / not-found / bad-prefix-400. Asserts the resolver isn't fooled by short-prefix collisions, which is the actual subtle bug class for prefix lookup.

cmd/server/rx_coverage_endpoint_test.go — bbox filter (insert one row, query the wrong bbox, expect 0), the happy-path 200 + FeatureCollection decode, missing-bbox→400. Catches the realistic SQL-filter bug ("bbox params not applied").

cmd/server/rx_dashboard_test.go — leaderboard fallback chain (nodes.nameclient_observers.name → empty), 7d window filter, observer-filter. Catches the most common dashboard bug (wrong join / wrong COUNT grouping).

test-coverage-gate.js — clever: extracts the actual nq-actions HTML expression from public/node-reach.js and evaluates it in a vm sandbox with MC_CLIENT_RX_COVERAGE true/false, asserts id="nqCoverage"/id="nqCovLegend" presence is gated correctly. This is not a hand-copied duplicate of the markup — it exercises the real source. If the source markup drifts, the test breaks. ✅

test-node-reach-coverage-e2e.js — Playwright real-browser test against the dev server: hits /api/config/client, skips cleanly when flag disabled (CI-safe default-off), otherwise asserts the FeatureCollection shape, the 400 on missing bbox, the #nqCoverage toggle issues a /rx-coverage XHR, the legend shows/hides, and the URL deep-link coverage=1 is set. Strong UX coverage.


Six Questions roll-up

Q Answer
a. Test that fails on revert? Yes — gate tests flip from 1↔0 rows; aggregation tests fail on wrong SNR pick; resolver tests fail on collision regression.
b. Smallest test for original bug? N/A (net-new feature), but each behavior has a minimal isolated test.
c. Could wrong impl pass? Looked for it. The aggregator tests and the gate tests both rule out "always return constant" and "always register route" wrong impls.
d. Edge cases NOT tested? (1) decoded.Payload.PubKey hex-validity on advert path. (2) Multiple companions reporting the same heard_key from very-different locations — does aggregation cluster correctly across cells? Probably fine but not asserted. (3) pos_acc_m zero/negative — no validation in buildClientReception. Non-blocking.
e. Names? All describe behavior (TestCoverageRoutesGatedOff, TestAggregateCoverageBucketsBestSNR, TestRxLeaderboard). ✅
f. Setup heavier than test? No. seedCoverageDB is shared once and small; per-test INSERTs are 1-3 lines. API is well shaped for testing. ✅

Must-fix: 0

Should-fix (non-blocking, file:line):

  • cmd/server/rx_dashboard_test.go:11-15insRx uses fmt.Sprintf to build SQL with interpolated values. Inputs are test-controlled, so safe, but it's an anti-pattern that will be copy-pasted. Prefer parameterised mustExecDB(t, db, "...?,?...", args...).
  • Consider splitting each net-new layer into add failing test + make it pass next time so the red→green discipline is auditable on the branch even after CI retention drops the per-SHA runs.

Out of scope

  • Whether clientRxCoverage should default off forever (config decision, not test quality).
  • Whether the corescope-rx companion app should be vendored / documented further (covered by docs/client-rx-coverage.md).

Tests demonstrate behavior coverage, anti-tautology holds, six questions satisfactorily answered. Net-new-feature exemption applies. Approved on TDD + test quality.

@Kpa-clawbot

Copy link
Copy Markdown
Owner

Munger Review (round 1) — schema & data model

"All I want to know is where I'm going to die, so I'll never go there." I went looking for where this table dies. Found several places.

The opt-in gate is genuinely clean: handleMessage short-circuits to handleClientPacket only when ClientRxCoverageEnabled(), and the call site is the only writer that touches client_receptions / client_observers. No side counters, no metrics fan-out — default-off truly means zero writes. That part is right.

The schema is additive (CREATE TABLE IF NOT EXISTS, CREATE INDEX IF NOT EXISTS) and re-runs idempotently inside applySchema. On a large prod DB it's safe — empty-table create + two single-column index creates on an empty table is microseconds. Good.

Where it dies:

Must-fix

  1. No retention / pruning for client_receptions — unbounded growth. cmd/ingestor/db.go:274 adds the table; grep -n "client_receptions" cmd/ingestor/db.go shows no PruneOld… counterpart, no RetentionConfig field, no cron entry. Compare PruneOldMetrics (db.go:1209), PruneDroppedPackets (db.go:1526) — every other write-heavy table in this codebase has a retention pruner wired to the retention loop. A mobile companion publishing at 1 Hz with multiple heard nodes/sec generates O(10⁴–10⁵) rows/day per client. Multiply by N clients, leave it for a year, the table is the largest in the DB and nothing reclaims space. Add PruneOldClientReceptions(days) and a RetentionConfig.ClientReceptionDays knob; wire into the existing retention goroutine; document the default in config.example.json.

  2. Missing time index — leaderboard and time-windowed coverage queries table-scan a growing table. cmd/server/rx_dashboard.go:1735 (rxLeaderboard) and rx_dashboard.go:1664 (queryCoverageFiltered with days>0) both do WHERE cr.rx_at >= ?. The two existing indexes are (heard_key) and (rx_pubkey) — neither supports a time scan. At 10⁶+ rows this is the query that pages out the cache. Add CREATE INDEX IF NOT EXISTS idx_client_recept_rx_at ON client_receptions(rx_at); and consider (rx_pubkey, rx_at) composite for the leaderboard GROUP BY.

  3. Bbox query has no spatial-or-time pruning index — full scan inside the bbox AND filter chain. queryCoverageRows and queryCoverageFiltered filter lat BETWEEN ? AND ? AND lon BETWEEN ? AND ? (rx_coverage.go:1240, rx_dashboard.go:1667). SQLite has no native R-tree on this table, and (heard_key)/(rx_pubkey) don't help bbox alone. For the per-node case heard_key filter narrows it; for the global /api/rx/coverage dashboard with no node filter, the bbox is the only selectivity and it does a full table scan every render. Either (a) add a coarse geohash column + index, or (b) require node or rx for the global endpoint, or (c) at minimum require days window + the index from (2). Pick one — don't ship the full-scan path.

  4. heard_keylen IN (2,3) branch defeats the heard_key index. WHERE … (heard_keylen = 32 AND heard_key = ?) OR (heard_keylen IN (2,3) AND substr(?, 1, heard_keylen*2) = heard_key) (rx_coverage.go:1240). The substr depends on the per-row heard_keylen, so the planner cannot drive the lookup from the index on heard_key. Splitting into a UNION ALL of two index-friendly queries — one for keylen=32 (full equality on heard_key), one for keylen=2 and one for keylen=3 each with heard_key = substr(?,1,4) / substr(?,1,6) (constants) — lets the index actually work. Same fix applies in queryCoverageFiltered.

  5. /api/nodes/resolve does public_key LIKE ? — SQLite's LIKE is case-insensitive by default, so this does NOT use the PK index; it full-scans nodes on every request. node_resolve.go:35 and the duplicated resolveHeardKey at rx_dashboard.go:1611. Two ways to fix: (a) PRAGMA case_sensitive_like = ON; at startup, or (b) replace with a range scan: WHERE public_key >= ? AND public_key < ? with the bind being pfx and pfx+next-hex. Option (b) is local, doesn't change global semantics, and is what you want anyway since prefixes are validated lowercase hex. Also: this is hit per-coverage-tile-render (via heardKeyResolver cache, which is request-scoped not server-scoped), so even with a small nodes table the cost adds up.

  6. /api/nodes/resolve is an unrate-limited enumeration oracle over a full-scan query. With LIKE-on-text scanning all of nodes on every call (point 5), an unauthenticated attacker can issue 256 + 65536 = ~66K cheap-looking requests and (a) walk the entire node namespace + names, and (b) inflict ~66K full table scans on the read DB. Node names are already public on the dashboard, so the enumeration angle is minor — but the DoS amplification angle (LIKE-scan × 65K requests, with no rate limit and no LIMIT on the underlying table scan) is real. At minimum: fix the index (point 5) AND require prefix length ≥ 4 hex chars (2 bytes — matches the firmware's 2-byte path-hash minimum the ingester already enforces). Combined that turns 65K probes into 65K cheap range scans of bounded fanout.

  7. rx_at is TEXT NOT NULL with no format CHECK. The aggregation code at rx_coverage.go:1183 relies on rx_at being lexically-comparable RFC3339 ("rx_at is RFC3339, so lexical >= is chronological"). resolveRxTime is the trusted source — fine when it produces RFC3339. But the column accepts any string, and a future code path or a manual operator INSERT (or a test fixture that writes 't1', which already happens in rx_dashboard_test.go:1370) will silently produce wrong ordering — the "latest SNR" is then whatever happens to sort largest. Add CHECK(rx_at GLOB '[0-9][0-9][0-9][0-9]-*') or store an INTEGER unix timestamp column alongside (rx_at_unix INTEGER NOT NULL) and index THAT for point 2. The unix-integer choice is what the rest of this schema should have been doing all along.

  8. client_observers.name is whatever the client publishes — no length cap, no sanitization. client_reception.go:73 writes stringField(msg, "origin") directly. A misbehaving or malicious companion publishes a 1 MB name; it lands in the table, gets returned verbatim in the leaderboard JSON, and the frontend dutifully renders it. Cap at e.g. 64 bytes on write (if len(name) > 64 { name = name[:64] }) and add CHECK(length(name) <= 64) to enforce it at the DB level.

  9. id INTEGER PRIMARY KEY AUTOINCREMENT on a hot write table costs you per-insert. AUTOINCREMENT (vs plain INTEGER PRIMARY KEY which is the rowid alias) forces SQLite to maintain sqlite_sequence and guarantees monotonic-never-reused ids. Nothing in this PR needs that guarantee — there's no FK referencing client_receptions.id, no audit trail. Drop AUTOINCREMENT. Cheap win on the hottest write path.

  10. heard_keylen has no CHECK constraint. The whole query model assumes heard_keylen ∈ {2, 3, 32} (see the WHERE clauses in queryCoverageRows/queryCoverageFiltered). The ingest path in deriveHeardKey enforces >= 2, but a 4-byte path-hash (or 16, or anything else firmware introduces tomorrow) will be written and then silently invisible to queries — coverage rows that the ingester accepted will never render. Add CHECK(heard_keylen IN (2, 3, 32)) OR teach the queries the general form. Pick one and write the test that fails when they diverge.

  11. No client_observers.last_seen retention; orphan rows accumulate. Companions that stopped publishing months ago still appear in JOIN results forever. Tie pruning to the client_receptions retention from (1): when a companion has no receptions in the retention window, drop its observer row.

Out-of-scope (pre-existing — file separate issues)

  • The read-side server uses s.db.conn.Query directly in many of the new endpoints (rx_coverage.go, rx_dashboard.go, node_resolve.go). The existing code mostly does the same, so this isn't new — but there is no QueryRowContext/QueryContext-with-request-context anywhere, meaning a slow query can't be cancelled when the HTTP client disconnects. That's a project-wide pattern, not introduced here. Worth its own issue: "Plumb r.Context() through read-side DB queries so disconnects cancel work".

  • applySchema runs the entire CREATE TABLE block on every startup. Idempotent, but masks schema drift — a partial column add would silently never run. Out-of-scope here; existing pattern.


Verdict: NEEDS-WORK. The feature is well-isolated and the gate is correct. The data-model decisions, however, are written for the demo dataset, not for "this table grows by 10⁵ rows/day forever." Fix points 1–6 before merge; 7–11 are cheap and should land in the same pass.

Charlie

@Kpa-clawbot

Copy link
Copy Markdown
Owner

DJB Review (round 1) — input parsing & opt-in gate

Threat model: this PR exposes three new attack surfaces — (1) a JSON ingest path tied to a public MQTT topic, (2) a per-prefix node-resolution endpoint, and (3) anonymous GeoJSON endpoints that materialize opted-in users' positions. The opt-in gate is well-placed (single chokepoint, fails closed), but several inputs cross trust boundaries without being parsed-into-a-validated-type.

The gate itself (requireClientRxCoverage 404 + ClientRxCoverageEnabled nil-safe default + per-handler check) is correct and the gate tests are good. Route registration order is correct (specific paths before catch-all). The hex grid math doesn't blow up on bad input by itself — but aggregateCoverage is downstream of lat/lon validation that has a gap. Findings, in priority order:

Must-fix

  1. buildClientReception accepts NaN/±Inf coordinatescmd/ingestor/client_reception.go:174. The bounds check lat < -90 || lat > 90 || lon < -180 || lon > 180 returns false for NaN (all NaN comparisons are false in IEEE-754 / Go). toFloat64 accepts JSON string fields and runs them through strconv.ParseFloat, which parses "NaN", "+Inf", "-Inf" successfully. Downstream hexMercatorhexCellAt produces undefined int(math.Round(NaN)) (architecture-dependent, typically -2^63), funneling every poisoned point into one or two arbitrary cells that are then emitted publicly. Add an explicit math.IsNaN(lat) || math.IsNaN(lon) || math.IsInf(lat,0) || math.IsInf(lon,0) reject before the range check, and the same for accPtr.

  2. /api/nodes/resolve enables 256-probe enumeration of every uniquely-prefixed nodecmd/server/node_resolve.go:18,22. hexPrefixRe = ^[0-9a-f]{2,64}$ accepts 2-char (1-byte) prefixes — below the project's own collision-prone threshold that deriveHeardKey enforces at keylen < 2 (i.e., 4 hex / 2 bytes minimum). With 256 two-char probes an unauthenticated caller harvests (pubkey, name) for every node whose 1-byte prefix is unique (typical for small/medium meshes — most of them). No rate limit, no auth. Fix: require minimum 4 hex chars ({4,64}) to match the rest of the codebase, and consider an even-length constraint (^([0-9a-f]{2})+$ with {4,64} length) so odd-length prefixes can't get through.

  3. rxPubkey from MQTT topic is not validated as hex before DB insertcmd/ingestor/client_reception.go:67,90. parts[2] is taken verbatim, lowercased, and written to client_receptions.rx_pubkey and client_observers.pubkey. If the broker ACL is misconfigured (the doc only says ACL is "recommended"), or the publisher uses a non-pubkey topic segment (legal MQTT), arbitrary strings end up as identity rows that then join into /api/rx-leaderboard output. Reject the message unless rxPubkey matches the same hex regex used elsewhere (^[0-9a-f]{64}$ for full pubkey).

  4. No length caps on attacker-controlled strings before DB writescmd/ingestor/client_reception.go:80-84. origin (companion's self-reported name) is taken straight from the JSON map and UpsertClientObserver'd. A hostile publisher can submit a 1MB origin string; SQLite will store it; the leaderboard endpoint will return it. Same applies to origin_id and direction. Cap each string field (≤64 bytes is plenty for a name) before any write.

  5. /api/rx-coverage allows whole-database scrape via wide bbox + max dayscmd/server/rx_dashboard.go:122-138. parseBBox accepts any (-90,-180,90,180); clampDays caps at 30; queryCoverageFiltered then issues an unbounded SELECT ... FROM client_receptions WHERE lat BETWEEN ... AND ... with no LIMIT. An anonymous caller can dump every opted-in user's reception history (lat/lon/rx_at/rssi/snr → trivially reconstructable movement traces) in one request. The hex aggregation runs in memory after the scan, so the DoS cost is the operator's, not the attacker's. Fix: enforce a max bbox area (e.g., reject if (maxLat-minLat) * (maxLon-minLon) > N) AND add a hard row cap on the SQL with a "result truncated" flag, AND consider serving low-zoom requests from a pre-aggregated cell table instead of raw rows.

  6. z=18 hex cells expose ~7m bins → single-contributor cells are raw GPScmd/server/hexgrid.go:23,41. hexTargetPx=28 at zoom 18 produces hex cells on the order of single-meter scale; any cell with a single contributor (which is the common case when a user is the only opt-in in their neighborhood) is effectively a published GPS fix at a specific time. K-anonymity is the standard mitigation here (suppress cells with Count < k, e.g., k=3, and refuse to render fine resolutions when contributing-client count is below threshold). The opt-in nominally consents to coverage publication; it does not consent to publishing one user's home location.

  7. MQTT broker ACL is documented as "recommended", but the trust model depends on it being mandatorydocs/client-rx-coverage.md:48-50. Without per-client ACLs binding parts[2] to the publisher's own pubkey, any MQTT client with publish rights to meshcore/client/+/packets can forge receptions attributed to any pubkey, salt the leaderboard, and plant false coverage anywhere on Earth. Either (a) document this as a hard requirement and add a startup check / operator banner when clientRxCoverage.enabled=true but no broker auth/ACL is detectable, or (b) require an in-payload signature over (rx_pubkey, rx_at, gps, raw) that the ingestor verifies against the companion's advertised pubkey. Option (b) is the only one that gives defense-in-depth ("don't rely on a single check"); option (a) at least surfaces the risk operators are taking on.

  8. No range validation on RSSI / SNR before DB insertcmd/ingestor/client_reception.go:54-58. int(f) on a JSON-supplied float happily truncates 1e20 into an undefined int (Go spec leaves out-of-range float→int conversions implementation-defined). Realistic RSSI is [-200, 0] dBm; SNR is [-40, +40] dB. Reject anything outside those (or clamp). Without this an attacker can pollute aggregations (bestSNR becomes +1e9) and poison sort order in sortedCoverageNodes.

  9. No retention prune for client_receptions / no heard_keylen upper boundcmd/ingestor/client_reception.go:118-130, 200-211. heard_keylen = len(last)/2 is stored with no upper bound; a junk path of arbitrary length is accepted (the SELECT filters by heard_keylen IN (2,3,32) so the row is dead on read, but it sits on disk forever). Combined with the ACL caveat in (7), this is unbounded storage growth on any deployment with the feature enabled. Add retention (retention.clientReceptionDays, default e.g. 30) following the same pattern as RetentionConfig.PacketDays, and reject rows where heard_keylen isn't in the set used by the query path.

Out-of-scope (pre-existing, not introduced by this PR)

  • The MQTT JSON parser uses json.Unmarshal with no input-size cap (general ingest path). Worth a separate hardening pass on handleMessage.
  • The toFloat64 helper accepts numbers in string form including unit suffixes ("−92dBm"), which is generous-by-design for legacy producers but increases parser attack surface globally. Out of scope here.
  • clampDays(1..30) matches reach-page conventions; not new.

Quiet approval

  • The opt-in gate is structurally correct: nil-safe ClientRxCoverageEnabled, single chokepoint, returns clean 404 (not SPA fallback). Tests cover both directions of the gate. The direction != "rx" + FLOOD-only-with-path + 2-byte-min keylen attribution rules in deriveHeardKey are conservative and well-justified against the firmware citations in the doc. nodes.public_key LIKE ? is safe because the input is regex-validated hex before the query (parse-don't-validate done right) — modulo the prefix-length issue in setInterval leaks in live.js — timers not cleared on page navigation #2.

Verdict: comment-only, do not merge until #1#3 and #5 are addressed. Items #4, #6, #7, #8, #9 are must-fix for the feature to be safe to enable on a public deployment, but can be argued as follow-ups if the feature ships off by default (it does) and the docs flag the open issues loudly.

@Kpa-clawbot

Copy link
Copy Markdown
Owner

Independent review (round 1)

Cold-read adversarial review. Diff matches PR title scope (no scope creep). The cmd/server/types.go "+100/-99" is gofmt-realignment from a single new ClientRxCoverage field; benign noise. Big surface, well-tested at the unit level. Below are must-fix items grouped by area.

Verdict: NEEDS-WORK (no hard blockers; multiple correctness, perf and discipline items).

Must-fix

  1. Gate vs blacklist ordering — bypass. cmd/ingestor/main.go:538-543 routes meshcore/client/<pk>/packets to handleClientPacket BEFORE cfg.IsObserverBlacklisted(parts[2]) runs (which is below at ~553). A blacklisted operator who publishes coverage on the client topic skirts the blacklist entirely. Either gate handleClientPacket on IsObserverBlacklisted(parts[2]) internally or reorder.
  2. rxPubkey from topic is not validated. handleClientPacket (cmd/ingestor/client_reception.go:16) trusts parts[2] blindly and writes it as the PK of client_receptions/client_observers. With a misconfigured/no-ACL broker, a publisher on meshcore/client/../packets or meshcore/client/!@#$/packets pollutes both tables. Reject anything not matching ^[0-9a-f]{2,64}$ (reuse hexPrefixRe analogue), and lowercase before compare.
  3. Dead production code: mobileRxStats (cmd/server/rx_coverage.go:187). Defined, exported only via test (rx_coverage_endpoint_test.go:45), never called from any handler. Either wire it into the per-node response or delete it.
  4. requireClientRxCoverage panics on nil cfg. cmd/server/rx_dashboard.go calls s.cfg.ClientRxCoverageEnabled() without guarding s.cfg == nil. Routes are registered unconditionally, so any call path that constructs Server without cfg crashes the request. Add if s == nil || s.cfg == nil { http.NotFound; return false }.
  5. No useful index for the dominant query. client_receptions queries filter lat BETWEEN ? AND ? AND lon BETWEEN ? AND ? AND (heard_keylen=32 AND heard_key=? OR heard_keylen IN (2,3) AND substr(?, 1, heard_keylen*2) = heard_key). The two added indexes (heard_key, rx_pubkey) don't help the bbox + prefix path (the substr predicate is non-sargable and there is no lat/lon index). Add a composite (heard_key, heard_keylen, lat, lon) or at minimum idx_client_recept_latlon. As the table grows this becomes a full scan per coverage request.
  6. Coverage redraw has zero debounce. public/node-reach-coverage.js:35 and public/rx-coverage.js:99 both bind moveend zoomend and fire /api/rx-coverage immediately. A user pan storm = N requests/sec. Add a small (≈200ms) trailing debounce.
  7. All client fetch errors are silently swallowed. .catch(function(){}) in node-reach-coverage.js and rx-coverage.js (drawCoverage, fitToObserver, loadBoard). The user sees an empty hex layer with no signal whether the server 5xx'd, the bbox was rejected, or the response was empty. At minimum console.warn, ideally a one-line in-page status.
  8. Non-deterministic GeoJSON feature order. aggregateCoverage iterates byCell (Go map) and appends features in random order. Output is non-deterministic across requests, defeats client-side caching/ETag and makes any "first feature" e2e assertion flaky. Sort fc.Features by cell before return.
  9. TestHandleClientPacketAdvertWritesReception is mis-named / tautological at its premise. The fixture is a relayed advert (non-empty path), so the test exercises the rxlog last-hop branch — the comment admits this. There is no end-to-end test that handleClientPacket actually emits a src='advert' row for the 0-hop case (deriveHeardKey is covered alone, but the path through gps/snr/storage isn't). Add a true 0-hop advert fixture or rename.
  10. Dead fallback in handleClientPacket. firstNonEmpty(rxPubkey, stringField(msg, "origin_id"))rxPubkey always comes from the topic and is checked for emptiness later. The origin_id fallback is unreachable. Either drop or (paired with setInterval leaks in live.js — timers not cleared on page navigation #2) use it only when the topic value fails hex validation.
  11. Per-cell Nodes slice is uncapped on the wire. Client truncates to COVERAGE_NODE_CAP=10 (rx-coverage.js:53), but aggregateCoverage ships every node per cell. Popular cells over 30-day windows can be huge. Cap server-side too (e.g. top 25 by latest SNR / count) and add a truncated flag.
  12. No response-size cap on /api/rx-coverage. Wide bbox + high zoom over the leaderboard window can produce multi-MB GeoJSON. Either cap the number of features or document a hard upper bound; current code will happily stream however much the DB returns.
  13. MC_CLIENT_RX_COVERAGE race with first page load. public/roles.js:534 sets the flag inside the MeshConfigReady promise. If the user lands directly on #/nodes/<pk>/reach (or #/rx-coverage) before that promise resolves, node-reach.js's load() reads window.MC_CLIENT_RX_COVERAGE === true as false and the coverage toggle is never injected. Either await MeshConfigReady in page init, or re-render on a mc:config-ready event.
  14. data-rx="..." attribute is built from raw DB strings. rx-coverage.js:75 interpolates o.pubkey straight into HTML; only o.name is escapeHtml'd. Pubkey is hex by convention today, but there is no DB CHECK constraint on client_observers.pubkey or client_receptions.rx_pubkey, and per setInterval leaks in live.js — timers not cleared on page navigation #2 unvalidated topic strings are inserted. That's an HTML-injection waiting to happen. Escape both.
  15. /api/nodes/resolve is always-on and enumerable. It accepts 2-char hex prefixes and returns {name, pubkey, ambiguous}. With 256 prefixes you can enumerate every known node. The opt-in gate intentionally excludes it (documented), but at minimum: rate-limit, require ≥3 hex chars, or only return {ambiguous} for prefixes shorter than 6.
  16. config.example.json lacks the same field in the ingestor section. Operators reading the example will see clientRxCoverage once and assume one flag; the server and ingestor each read their own config and each has its own struct. Confirm the example is interpreted by both, or document where else the flag needs to live.
  17. Hex grid degenerates at high latitudes. hexInvMercator is undefined past ~85.05°. Coverage submitted from polar regions will produce NaN rings. Clamp lat in hexCellAt and document the supported range.
  18. heard_keylen index missing. Queries gate on heard_keylen IN (2,3) and =32, with no index covering it. Combined with Potential XSS: decoded.text not escaped in node detail panel #5, this is the second part of the same scan problem.
  19. nqCovLegend mixes inline style="display:..." with CSS class control. node-reach.js:194-200 writes inline style and toggles it in applyCoverage; CSS overrides won't win. Use a class (.is-hidden) instead.
  20. aggregateCoverage node.name may flap. Inside the loop, if name != "" { na.name = name } overwrites with the latest resolver result. When resolver returns ("", "") for an ambiguous prefix and later (pk, "Alice") for the full pubkey, the order depends on row order. Tested only with a constant resolver — add a test for the mixed case so the precedence is locked.

Out-of-scope (don't block; file follow-ups)

  1. The whole "/api/nodes/resolve always-on" privacy stance probably deserves a dedicated enumeration-rate-limiter, not a per-PR fix.
  2. CGO-free hex grid is a clean MVP, but a future move to H3 (or an opt-in build tag) would be worth tracking.
  3. No metric / /api/stats field for "coverage receptions in last hour" — would help operators see the feature is alive.
  4. client_receptions has no retention sweep wired into existing retention config — will grow unbounded.
  5. EMQX ACL trust model is documented but not asserted/tested anywhere on the server side.

Re-spawn with <!-- mc-bot-reviewed:v2 --> after addressing.

Erwin Fiten and others added 15 commits June 16, 2026 09:24
…lawbot#10)

The companion identity is the topic segment in meshcore/client/<pk>/packets,
which the broker is expected to ACL-bind to the publisher. On a broker without
ACLs an attacker could publish under an arbitrary topic (e.g. !@#$) and pollute
client_receptions / client_observers with junk pubkeys.

Reject any topic pubkey that is not lowercase hex (^[0-9a-f]{2,64}$, mirroring
the server-side hexPrefixRe) before any write. Because the topic value is now
always validated, the firstNonEmpty(rxPubkey, origin_id) payload fallback is
both unreachable (Kpa-clawbot#10) and a trust hole, so it is removed; the companion
identity comes only from the ACL-bound topic.

Test: TestHandleClientPacketRejectsNonHexPubkey drives non-hex topic segments
and asserts zero rows in both tables (fails without the guard). Existing
fixtures updated to use a valid hex companion pubkey.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
handleMessage dispatched the meshcore/client/<pk>/packets topic to
handleClientPacket and returned before the IsObserverBlacklisted check that
guards the observer path. A blacklisted operator could therefore keep feeding
coverage data through the client topic.

Check the blacklist for the companion pubkey at the top of the client dispatch
branch and drop (with a log) before any write.

Test: TestClientRxCoverageBlacklistedDropped drives handleMessage with the
feature ON and the companion pubkey blacklisted, asserting zero rows; it fails
without the gate because the old order inserted before the blacklist ran.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
requireClientRxCoverage dereferenced s.cfg directly while the coverage routes
are registered unconditionally, so a nil server/cfg would panic instead of
returning a clean 404. Guard s == nil / s.cfg == nil and make the
(*Config).ClientRxCoverageEnabled() receiver nil-safe too.

Test: TestRequireClientRxCoverageNilSafe drives handleRxCoverage with a nil cfg
and asserts 404 (panics without the guard), plus the nil-receiver helper.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Kpa-clawbot#15)

The prefix resolver accepted 2-char hex prefixes, so the 256 one-byte prefixes
enumerated every node name; and, unlike /api/nodes/search and /api/resolve-hops,
it returned blacklisted / hidden-prefix (Kpa-clawbot#1181) node identities the rest of the
API hides.

- Require >= 4 hex chars. 1-byte (2 hex) keys are never stored (the ingestor
  rejects heard keys shorter than 2 bytes), so the floor matches the data model
  while ruling out trivial full-table enumeration.
- A unique match that is blacklisted or hidden now resolves as not-found.
- Apply the same identity-hiding to resolveHeardKey so coverage tooltips don't
  leak hidden node names either.

The endpoint is kept (single-prefix -> name lookup, distinct from the q= fuzzy
search and the hop-context resolver) and stays in openapi_known_gaps.json.
Per-IP rate limiting is left to follow-up #1.

Tests: TestResolvePrefix gains <4-hex 400 cases and an aabb-collision ambiguous
case; TestResolvePrefixHidesBlacklistedAndHidden asserts blacklisted/hidden
matches resolve as not-found (both fail without the fix).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…lawbot#14)

renderBoard built each leaderboard row by string concatenation and interpolated
o.pubkey raw into data-rx="..." (and into the truncated fallback label when the
observer has no name) while only o.name was escaped. A non-hex pubkey (possible
on a no-ACL broker, or in rows ingested before the #2 validation) could break
out of the attribute and inject markup.

escapeHtml() both the data-rx value and the truncated-pubkey label.

Test: test-rx-coverage-escape.js slices the real row-builder out of
rx-coverage.js and renders it with a markup-bearing pubkey, asserting no raw
tag survives (fails when the escaping is reverted).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…a-clawbot#8, Kpa-clawbot#20)

aggregateCoverage built its GeoJSON features by ranging a Go map, so feature
order was randomized — defeating ETag/caching and making "first feature" e2e
checks flaky (Kpa-clawbot#8). And the per-node name was set with `if name != "" { name =
... }`, so when several heard_keys mapped to the same node the displayed name
depended on row/map order (Kpa-clawbot#20).

- Sort fc.Features by cell before returning.
- Lock the node identity (name and display-prefix fallback) to the most
  specific (longest) heard_key that resolved, tie-broken lexicographically, so a
  full-pubkey reception outranks a short prefix independent of order.

Tests: TestAggregateCoverageDeterministicFeatureOrder asserts features come out
sorted by cell; TestAggregateCoverageNamePrecedenceOrderIndependent feeds the
same rows in both orders with a resolver that returns different names per
heard_key and asserts the name is stable (fails under the old last-writer rule).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…bot#12)

A wide bbox at high zoom over the 30-day window could return unbounded GeoJSON:
every hex cell, each with every node heard there.

- Cap the per-cell node breakdown at coverageCellNodeCap (25); set
  properties.nodes_truncated when more nodes were heard than returned (Kpa-clawbot#11). The
  client only renders ~10, so this just bounds the wire payload.
- Cap the feature collection at coverageFeatureCap (5000) cells, keeping the
  densest (count desc, cell asc tie-break for determinism) and setting the
  top-level truncated flag (Kpa-clawbot#12). Both flags are omitempty so untruncated
  responses are unchanged. truncated/nodes_truncated are GeoJSON foreign members
  that Leaflet ignores.

Tests: TestAggregateCoverageCapsNodesPerCell (30 nodes -> 25 + flag) and
TestAggregateCoverageCapsFeatures (5625 cells -> 5000 + flag, still cell-sorted,
small query untruncated).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
mobileRxStats was defined and tested but never called by a handler — dead
production code. Rather than drop it, surface the node-wide totals it computes:
the per-node coverage endpoint now returns mobile_receptions and mobile_clients
(distinct contributing companions) as foreign members on the FeatureCollection,
so the UI can show "heard by N clients" independent of the current bbox/pan.
Both are omitempty, so the global /api/rx-coverage payload is unchanged.

Test: TestRxCoverageEndpointGeoJSON now asserts the wired-in totals (1/1 for the
single seeded reception); fails if the stats aren't attached.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…bot#17)

hexMercator diverges toward the poles (tan(π/4 + lat·π/360) → ∞), so a coverage
submission past ~85.05° produced NaN hex rings via hexInvMercator. Clamp lat to
±hexMaxLat (85.05112878) in hexCellAt and document that coverage is defined only
within that range.

Test: TestHexCellAtClampsPolarLatitude drives ±89.9/±90° and asserts they bin to
the clamped edge cell with a finite (non-NaN/Inf) boundary ring.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
TestHandleClientPacketAdvertWritesReception used a RELAYED advert (non-empty
path), so it exercised the rxlog last-hop branch — not the 0-hop src='advert'
path its name implied. Rename it to ...RelayedAdvert... and add
TestHandleClientPacketZeroHopAdvertWritesReception, which rebuilds the same
advert with zero hops (header + "00" + payload) and asserts the advertiser is
stored by its full pubkey with src='advert', plus gps/snr capture.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…a-clawbot#18)

The per-node coverage query filters by lat/lon bbox AND matches the heard node
by full key or 2-3 byte prefix; the only indexes were single-column heard_key /
rx_pubkey, so the bbox path (the sargable common filter) fell back to a full
table scan.

Add a composite (heard_key, heard_keylen, lat, lon) — which serves the
heard_key-equality seek, carries lat/lon for the range, and supersedes the old
single-column heard_key index — plus idx_client_recept_latlon so the planner can
drive from a selective bbox (Kpa-clawbot#18 covers heard_keylen). CREATE INDEX IF NOT
EXISTS in the base schema covers fresh and existing DBs (the table is new in
this PR).

Test: TestClientReceptionsCoverageQueryUsesIndex asserts EXPLAIN QUERY PLAN uses
a client_recept index and no longer SCANs the table.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…, Kpa-clawbot#7)

Both coverage layers bound moveend/zoomend straight to an immediate fetch, so a
single drag fired a storm of /api/rx-coverage requests (Kpa-clawbot#6); and every coverage
fetch swallowed errors with an empty .catch (Kpa-clawbot#7), so failures were invisible.

- Wrap the pan/zoom redraw in the shared 200ms debounce (keeping a stable
  reference so node-reach-coverage can still off() the handler). Direct redraws
  (day switch, fit-to-observer) stay immediate.
- Replace the empty catches with console.warn; the leaderboard failure also
  shows a one-line in-page message.

Test: test-node-reach-coverage-debounce.js loads addLayer with controllable
timers + the real debounce, fires a 6-event burst and asserts exactly one
coalesced fetch after the settle (7 fetches without the debounce).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Kpa-clawbot#19)

nqCovLegend's visibility was driven by an inline style="display:flex|none" that
applyCoverage rewrote, so CSS (print rules, themes) couldn't override it. Use an
.is-hidden class instead (matching the .nav-more-wrap.is-hidden pattern) toggled
via classList, and add .nq-cov-legend.is-hidden { display:none !important } to
node-reach-coverage.css.

Test: test-coverage-gate.js now asserts the rendered legend carries the
is-hidden class and uses no inline display style (fails if the inline style
returns).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…, carmack)

Adds BenchmarkCoverageQuery: seeds ~1M receptions across a metro-area bbox and
times the per-node coverage query three ways. On a Ryzen 7 PRO 8845HS,
modernc.org/sqlite, 1M rows (benchtime 5x):

  or_query_indexed       2.28 s/op   (original OR/substr query, latlon index)
  or_query_table_scan    0.30 s/op   (same query, indexes dropped)
  inlist_query_indexed   0.42 ms/op  (sargable heard_key IN-list + composite)

The OR/substr shape can't use the heard_key index, so the planner drives from
the bbox and pays a random row fetch per candidate — slower than a plain scan.
Rewriting the per-node match as a heard_key IN-list (next commit) lets the
(heard_key, …) composite seek the few hundred matching rows: ~5400x faster than
the indexed OR query and bounded by node, not table size. Benchmarks don't run
in CI; this is on-demand evidence for the perf claim.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…N-list (#5)

The benchmark (previous commit) showed the OR/substr match — "heard_keylen=32
AND heard_key=? OR heard_keylen IN (2,3) AND substr(?,1,keylen*2)=heard_key" —
can't use the heard_key index, so the planner drove from the bbox and paid a
random row fetch per candidate: 2.28 s/op at 1M rows, slower even than a full
scan (0.30 s).

A node's heard_key is always exactly its full pubkey or its 2-/3-byte prefix, so
replace the OR/substr with heard_key IN (pubkey, pubkey[:6], pubkey[:4]) — an
equivalent but sargable predicate. The (heard_key, …) composite index then seeks
the few hundred matching rows: 0.42 ms/op (~5400x faster), bounded by node not
table size. Applied to queryCoverageRows, queryCoverageFiltered (node filter) and
mobileRxStats via the shared coverageHeardKeyCandidates helper.

Test: TestClientReceptionsCoverageQueryUsesIndex now EXPLAINs the IN-list shape
and asserts an index seek, not a table scan. Existing coverage/endpoint tests
(which assert the same result rows) still pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Kpa-clawbot

Copy link
Copy Markdown
Owner

Kent Beck Gate (round 2) — TDD + test quality

Verdict: PASS (1 minor / non-blocking).

Round-1 findings I sampled each have a dedicated test that fails on revert with an assertion (not a build error), and the commit history shows one fix-per-commit with the test bundled — clean red-bait → green pattern for a backfill of issues found in review. I did not require classic red-then-green commits per finding because the round-1 work is a review-rework batch (not greenfield), and the tests demonstrably gate the behaviour.

Per-finding gate (8 sampled)

# Commit Test Anti-tautology check
#1 observer blacklist on client topic 44fc6ac8 TestClientRxCoverageBlacklistedDropped Asserts 0 rows; revert restores pre-blacklist insert path → 1 row. ✅
#2 / #10 hex-validate companion pubkey 6643280c TestHandleClientPacketRejectsNonHexPubkey Drives !@#$, companionpk, ``, g0g0, `xyz`; asserts 0 receptions AND 0 observers (catches both tables). ✅
#14 HTML-escape pubkey in leaderboard 970b3b70 test-rx-coverage-escape.js Slices the real row-builder out of rx-coverage.js via vm, drives "><img …, asserts no raw <img survives. No hand-copied duplicate — tracks source. ✅
#15 /resolve enumeration + identity hiding 4183b14a TestResolvePrefix + TestResolvePrefixHidesBlacklistedAndHidden Asserts 400 on a/aa/abc AND empty resolution for blacklisted+hidden. Each behavior independently gated. ✅
#4 nil-safe coverage gate 197e5f2e TestRequireClientRxCoverageNilSafe Exercises Server{} with nil cfg → 404, not panic. Reverting the helper's nil-receiver guard (the load-bearing half) does panic. The s.cfg == nil belt-and-suspenders in rx_dashboard.go is not independently gated, but that's defense-in-depth, noted in code. ✅
#8 / #20 deterministic feature order + name precedence c7806460 TestAggregateCoverageDeterministicFeatureOrder + TestAggregateCoverageNamePrecedenceOrderIndependent Name-precedence test runs rows in both orders with a resolver that returns different names per key — strong, will catch order-dependence reliably. Feature-order test is the minor concern below. ⚠️ (see M1)
#11 / #12 bound coverage response 22655fc1 TestAggregateCoverageCapsNodesPerCell, TestAggregateCoverageCapsFeatures 30 nodes → 25+flag, 5625 cells → 5000+flag, AND a "small query is not truncated" inverse check — Six-Questions Q3 covered. ✅
#17 clamp polar latitude eac0e511 TestHexCellAtClampsPolarLatitude Asserts cell equality with clamped lat AND walks the boundary ring for NaN/Inf. Revert → tan(π/4 + …) → ∞ produces non-finite points → fails. ✅

Spot-checked the JS layer fixes too:

CI on tip e1b7d089: Go Build & Test ✅, Playwright E2E ✅.


Minor (non-blocking) — M1

TestAggregateCoverageDeterministicFeatureOrder (rx_coverage_test.go) feeds 4 cells and asserts the output is cell-sorted. With Go's randomized map iteration, 4 input cells produce 24 possible orderings — 1 is sorted by chance, so this test has a ~4% false-pass rate against a regression that removed the sort.Slice. The companion TestAggregateCoverageCapsFeatures does re-assert sorted order with thousands of cells, so the regression is gated overall, but the named "determinism" test would be sturdier with ≥10 cells or a deterministic-shuffle loop. Recommend tightening, not blocking.


Six Questions roll-up:

  • (a) Behavior vs implementation: test names describe behavior (RejectsNonHex…, ClampsPolarLatitude, NilSafe, OrderIndependent, Caps…). ✅
  • (b) Smallest test to catch the bug: each test isolates the exact branch with minimal fixtures; coverage_gate uses a shared testCompanionPK constant rather than duplicating hex. ✅
  • (c) Could a wrong impl pass? Mostly no — most tests assert both positive and negative branches. M1 is the one weak spot.
  • (d) Edge cases not tested: lat = exactly ±hexMaxLat boundary not explicitly asserted (only past it); fine — boundary clamp is a no-op there. Per-IP rate limit (VCR bar unusable on mobile — touch targets below 44px #15 follow-up) explicitly punted to issue Feed panel overflow:hidden silently clips items instead of scrolling #1 in the commit body. Out-of-scope ✅.
  • (e) Names: all behaviour-named, good. ✅
  • (f) Setup vs assertion: test setup ≤ assertion in every file. ✅

Gate: PASS. Ship it.

@Kpa-clawbot

Copy link
Copy Markdown
Owner

Munger Review (round 2) — schema & data model

Re-reviewed after the sargable rewrite (15b20f8/f46f67df/67358053), cap policy (22655fc), identity lock (c780646), and the new retention reaper (6eba937).

Verdict: 1 must-fix, 2 strong recommendations, the rest is good work.

The OR/substr → heard_key IN (...) rewrite is the right call. EXPLAIN QUERY PLAN on the rewritten query confirms SEARCH client_receptions USING INDEX idx_client_recept_heard_geo (heard_key=?) — index seek, not bbox scan. The bench harness (coverage_query_bench_test.go) compares both shapes at ~1M rows; numbers will speak for themselves. heard_keylen as the second column in the composite is redundant (the IN-list already enumerates the only legal lengths), but it's harmless and the lat/lon trailing keeps the index covering for the bbox range. Fine.

Cap policy (#11/#12) is correct in shape: per-cell sorted by SNR desc → top 25 with nodes_truncated; feature collection sorted by density → top 5000 with truncated. Densest-kept is the right loss policy for a heatmap. Identity lock (#20) by longest heard_key with lexical tie-break is order-independent — I checked, no flap. GeoJSON sort by cell is deterministic. Good.


MUST-FIX

1. client_receptions.rx_at has no index — the reaper, the leaderboard, and the time-window filter all scan the table. (cmd/ingestor/db.go:299-301, cmd/ingestor/maintenance.go:55-75, cmd/server/rx_dashboard.go:108-118)

Three places do rx_at <op> ?:

  • Reaper: DELETE FROM client_receptions WHERE rx_at < ?EXPLAIN QUERY PLAN returns SCAN client_receptions. Daily, plus once at startup before ingestBuffer.Ready(). With 30 days of accumulation at any non-toy ingest rate, this is a multi-second writer-lock hold that stalls MQTT every reaper tick and delays startup drain.
  • Leaderboard: SELECT … FROM client_receptions cr … WHERE cr.rx_at >= ? GROUP BY cr.rx_pubkey ORDER BY COUNT(*) DESC → planner falls back to SCAN cr USING INDEX idx_client_recept_rxpk + temp B-tree for COUNT(DISTINCT) and ORDER BY. Every dashboard hit pays full-scan cost.
  • Coverage filter: queryCoverageFiltered adds rx_at >= ? when days > 0 (i.e. every UI request). The heard_key IN-list keeps it sargable for per-node queries, but the global /api/rx-coverage?bbox=… (no node=) drops the heard_key predicate and falls back to the bbox index with a post-filter on rx_at — full bbox scan for every dashboard panning event over the 30-day window.

UNIQUE(rx_pubkey, heard_key, rx_at) does NOT help — rx_at is not the leading column.

Fix: CREATE INDEX IF NOT EXISTS idx_client_recept_rx_at ON client_receptions(rx_at); in the same applySchema block. Cheap to add now; expensive to add later when the table is 50M rows and CREATE INDEX has to grab the writer for minutes.

This is the lollapalooza I'm warning about: a feature that grows unbounded between reaper sweeps (PR shipped with no growth-rate ceiling other than retention), a reaper that scans the whole table, and an opt-in dashboard whose own queries also scan. Each of the three pieces looks fine in isolation; together they compound into "coverage works for a month, then ingest mysteriously stalls every midnight."


Strong recommendations (not blocking)

2. Reaper holds the writer lock for the entire delete in one transaction. (cmd/ingestor/maintenance.go:64-78)

PruneOldPackets has the same shape and the same risk, but at least it has idx_transmissions_first_seen to make the WHERE selective. Once #1 is fixed and the planner can seek on rx_at, this is much less acute — but for a daily delete of potentially hundreds of thousands of rows it's still worth chunking, the way BackfillFromPubkey does (LIMIT chunkSize loop, commit between batches, optional time.Sleep to yield). The reaper runs before ingestBuffer.Ready() on startup, so an unchunked multi-second delete directly inflates the buffered backlog.

3. client_observers.last_seen is also unindexed. (cmd/ingestor/db.go:306-309, cmd/ingestor/maintenance.go:71-74)

Same shape: DELETE FROM client_observers WHERE last_seen < ? inside the same WriterTx. Table is small today, but it's piggybacking on the same reaper and the same writer-lock hold. Either add CREATE INDEX … ON client_observers(last_seen) or accept it consciously and document the upper-bound size assumption.


Out-of-scope (UI / future)

The server flags truncated and nodes_truncated correctly, but public/rx-coverage.js and public/node-reach-coverage.js never read either field. So the user-visible promise ("no silent data loss without flag") holds at the API contract, but the dashboard still silently drops cells from the user's perspective. Worth a follow-up issue, not a blocker for this PR.


"All I want to know is where I'm going to die, so I'll never go there." The reaper is where this feature dies at scale. Add the index.

@Kpa-clawbot Kpa-clawbot left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Independent review (round 2)

Cold review — verified the 20 round-1 must-fix items are real (read every fix in gh pr diff, cross-checked against the noted commits). All accounted for:

  • #1 blacklist on client topichandleMessage checks cfg.IsObserverBlacklisted(parts[2]) before dispatch, plus test (TestClientRxCoverageBlacklistedDropped). ✓
  • #2/#10 hex pubkey validationclientPubkeyRe enforced; payload origin_id fallback removed; test covers !@#$, companionpk, xyz. ✓
  • #3 mobileRxStats wired in — per-node response carries mobile_receptions/mobile_clients; asserted. ✓
  • #4 nil-safe gateServer/Config/ClientRxCoverageEnabled all nil-safe; TestRequireClientRxCoverageNilSafe covers it. ✓
  • #5/#18 sargable IN-list + composite indexcoverageHeardKeyCandidates + composite (heard_key, heard_keylen, lat, lon); EXPLAIN-QUERY-PLAN test + 1M-row benchmark with numbers. ✓
  • #6/#7 debounced redraws + surfaced errorsdebounce(refresh,200) with stable handler, console.warn replaces empty .catch; in-DOM test fires 6-event burst → 1 fetch. ✓
  • #8/#20 deterministic features + name precedencesort.Slice by cell, longest-heard-key name wins; order-independent test. ✓
  • #9 true 0-hop advert — rebuilt 1100 + payload, asserts src='advert'/keylen=32 + GPS captured. ✓
  • #11/#12 response boundscoverageCellNodeCap=25, coverageFeatureCap=5000, omitempty flags; tests cover both caps. ✓
  • #13 MeshConfigReady race — both rx-coverage.init() and node-reach.load() await; loadGen race guarded; controllable-promise test. ✓
  • #14 escape pubkeyescapeHtml(o.pubkey) for data-rx + label fallback; sliced-source test asserts no raw <img survives. ✓
  • #15 resolve hardening — 4-hex floor, blacklist/hidden-name parity (also applied to resolveHeardKey); both tests. ✓
  • #16 docs — single-flag/ACL-required language, retention/indexes/clamp/bounds documented. ✓
  • #17 polar clamp±hexMaxLat=85.05112878 in hexCellAt; ±90° test checks finite ring. ✓
  • #19 class-based legend hiding.is-hidden class + CSS !important; gate test grep-asserts no inline style="display:. ✓
  • #1727 retention reaperPruneOldClientReceptions via WriterTx (#1283 honored), startup + 24h ticker, independent of feature flag, clientRxDays in config.example.json. ✓

Scope matches the PR description (opt-in mobile client-RX coverage + /api/nodes/resolve); no scope creep beyond the gofmt/struct-alignment whitespace touch-ups in cmd/server/{config,types}.go.

New must-fix (nits surfaced in round 2)

  1. public/rx-coverage.js:9,118,134selectedName is assigned in three places but never read or rendered. Dead var; either drop it or surface it (e.g., a "Filtered to: " badge above the map).
  2. cmd/server/rx_coverage.go:160-176aggregateCoverage sorts features twice when truncation applies (densest-desc to slice, then cell-asc unconditionally). One pass after the slice is enough — second sort.Slice is cheap but the densest-first sort can be replaced with a partial selection.
  3. cmd/server/rx_dashboard.go:33-45heardKeyResolver's per-request cache is unbounded. A wide query touching N distinct heard_keys grows the map O(N) without an eviction cap. Bound to, say, 4096 entries or skip caching when over a threshold; the existing tests pass either way.
  4. cmd/server/config.go:320-325 — server RetentionConfig is missing ClientRxDays. The ingestor reads it; the server's JSON struct silently drops the operator's retention.clientRxDays key on parse. Even though the server doesn't prune, the field-set mismatch is a future footgun — mirror the field (or add a json:"-" placeholder with a comment) for symmetry.
  5. cmd/server/rx_dashboard.go:107-126,165-183rxLeaderboard and queryCoverageFiltered don't filter blacklisted rx_pubkey. Ingest is blocked now (#1), but pre-blacklist legacy rows still surface a banned companion's name in the leaderboard and let an attacker target ?rx=<blacklisted>. Defense-in-depth: add a NOT IN (blacklist) filter (or a WHERE NOT EXISTS …) mirroring how /api/nodes/resolve hides identities.
  6. cmd/server/rx_coverage.go:queryCoverageRows / mobileRxStats — no guard against empty pubkey. Today the mux route guarantees non-empty, but coverageHeardKeyCandidates("") returns [] which would yield WHERE heard_key IN () (SQL syntax error). One-line guard (if pubkey == "" { return nil, nil }) is cheap insurance against a future caller.

Out-of-scope

  1. Per-IP rate limiting on /api/nodes/resolve (acknowledged in author commit message as follow-up #1).
  2. Optional companion-signed broker token (called out in docs as future hardening).

Verdict: request changes on the six nits above. The 20 round-1 fixes are real and well-tested.

@Kpa-clawbot

Copy link
Copy Markdown
Owner

DJB Review (round 2) — input parsing & opt-in gate

Verdict: request changes — round 1 must-fixes are present and correct, but the same threat class survives on two read-side endpoints. Two new must-fixes, one out-of-scope hardening note.

Round 1 fixes — verified

Must-fix #1handleNodeRxCoverage leaks coverage geometry + counts for blacklisted / hidden nodes

cmd/server/rx_coverage.go lines 329–356 (handleNodeRxCoverage).

handleNodeReach (cmd/server/node_reach.go line 399) returns 404 for blacklisted and isPubkeyHidden pubkeys. The new per-node coverage endpoint at the SAME {pubkey} does not. heardKeyResolver hides the name (#15 parity) — but the GPS hex bins and mobile_receptions/mobile_clients counts are still returned for any blacklisted or hidden-prefix node, defeating the very hiding the rest of the API enforces. An attacker who knows a hidden node's pubkey can map where it has been received. Add the same IsBlacklisted + isPubkeyHidden 404 gate at the top of handleNodeRxCoverage (after pubkey is parsed). Also add isHexPubkey(pubkey) parity 400, matching handleNodeReach.

Must-fix #2/api/rx-leaderboard returns blacklisted observers (incomplete #1)

cmd/server/rx_dashboard.go lines 174–200 (rxLeaderboard).

The SQL is SELECT cr.rx_pubkey, COALESCE(... n.name, co.name), ... FROM client_receptions cr LEFT JOIN nodes n ON n.public_key=cr.rx_pubkey LEFT JOIN client_observers co ON co.pubkey=cr.rx_pubkey GROUP BY cr.rx_pubkey. There is no filter for ObserverBlacklist, NodeBlacklist, or IsNameHidden. Threats:

  1. client_receptions rows that landed before this PR ships (no ingest-side blacklist gate existed) → blacklisted operator stays on the leaderboard forever, with name.
  2. Race window: operator added to blacklist while rows accumulate, OR config reload lag → leaks the same identity the rest of the API hides.
  3. A hidden-prefix node name (e.g. 🚫…) the nodes table holds is returned verbatim, contradicting IsNameHidden everywhere else.

Fix: filter the result set by IsObserverBlacklisted(rx_pubkey) and blank/strip the name when IsBlacklisted(rx_pubkey) || IsNameHidden(name). Cheapest correct form: post-query filter in Go using the same helpers the resolver uses (the result set is LIMIT 100 max — no perf concern). Add a test mirroring TestResolvePrefixHidesBlacklistedAndHidden.

New surface in retention reaper (6eba937) — clean

PruneOldClientReceptions deletes by rx_at < cutoff. rx_at is set via resolveRxTime, which hard-rejects future >14h and past >30d, so a malicious companion cannot publish rx_at=9999-… to evade pruning. WriterTx honors the single-writer rule. client_observers is pruned by server-set last_seen (not user-controlled). Default clientRxDays=30, 0 disables. No injection — bound parameters throughout. ✓

New surface in config-race fix (50da141) — clean

Pure JS init deferral behind MeshConfigReady, guarded by destroyed for navigation-away. No secrets touched, no DOM injection. ✓

Out-of-scope hardening (not blocking this PR)

  • /api/rx-coverage ?node= and ?rx= query params accept any string; lowercased and bound, no injection, but node= should reject non-hex up front rather than running a useless query.
  • No per-IP rate-limit on /api/nodes/resolve; with the 4-hex floor this drops worst-case enumeration cost from 256 to 65,536 queries but does not eliminate it. Referenced in PR body as "follow-up Feed panel overflow:hidden silently clips items instead of scrolling #1" — acceptable.

— djb 🛡️

…d (review r2)

Round-1 hid the node *name* via the resolver, but two read endpoints still leaked
the identities the rest of the API suppresses:

- handleNodeRxCoverage returned GPS hex bins + mobile_receptions/mobile_clients
  for any {pubkey}, so a blacklisted or hidden-prefix node's coverage was
  mappable by anyone who knew the key. Now mirrors handleNodeReach: reject
  non-hex pubkeys (400) and 404 blacklisted / isPubkeyHidden nodes before any
  query.
- rxLeaderboard had no blacklist/hidden filter, so a pre-PR or post-blacklist
  client_receptions row kept a banned operator on the board (with name). Now
  drops IsObserverBlacklisted contributors and blanks the name when
  IsBlacklisted(pubkey) || IsNameHidden(name). Result set is <=100, filtered in
  Go with the same helpers the resolver uses.

Tests: TestNodeRxCoverageHidesBlacklistedAndHidden (404 for both) and
TestRxLeaderboardHidesBlacklistedAndHidden (drop + blank); both fail without the
gates. setupTestDBv2's nodes table gains foreign_advert so GetNodeByPubkey (used
by isPubkeyHidden) works against the test schema. The per-node endpoint now
requires a full 64-hex pubkey, matching handleNodeReach.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Kpa-clawbot

Copy link
Copy Markdown
Owner

Automated polish review — 4-persona parallel fan-out

Reviewers: adversarial · carmack (backend) · kent-beck (tests) · tufte (frontend). CI all-green ✅ (Go build, Playwright, Docker). mergeable=MERGEABLE, mergeStateStatus=CLEAN. Tests present (10 new _test.go files + 6 e2e JS).

Verdict: not merge-ready as-is. 1 BLOCKER + 10 MAJOR across all four reviewers, predominantly clustered around privacy + N+1 SQL + frontend a11y/dark-theme. Holding for operator triage rather than auto-merging.


BLOCKER (1)

  • Per-observer movement-trail disclosure. cmd/server/rx_dashboard.go:131-156 + public/rx-coverage.js:74-78. /api/rx-coverage?rx=<pubkey>&bbox=-90,-180,90,180&days=30&z=18 returns hex polygons (~10–30 m on the ground at z=18) for one companion only, with no auth and no k-anonymity threshold. The leaderboard publicises companion pubkeys and the UI wires a click that filters to a single rx — converting an opt-in coverage contribution into a public 30-day GPS trail (home / work / commute) of that user. Onboarding copy in docs/client-rx-coverage.md does not warn that single-observer views are world-readable. Fix: gate ?rx= filtering behind k-anonymity (e.g. require ≥N distinct cells AND ≥N distinct heard nodes), coarsen reported hex centroids when only one observer contributes, or drop the per-observer view entirely. At minimum snap stored lat/lon server-side before insert and add a loud privacy warning to onboarding + docs.

MAJOR (10)

Privacy / abuse (adversarial)

  • cmd/ingestor/client_reception.go:36-50 (buildClientReception) — raw GPS lat/lon stored at full float64 precision; pos_acc_m recorded but never enforced. Quantise to ~4dp (≈11 m) and drop rows with pos_acc_m above a threshold.
  • cmd/ingestor/main.go:568-578 + cmd/ingestor/client_reception.go — no per-rx_pubkey write rate limit on meshcore/client/<pubkey>/packets. One compromised companion can flood the writer mutex at line rate, starving real ingest. Add a token bucket or per-minute coalesce.
  • cmd/server/node_resolve.go:23-49 — unauthenticated, unrate-limited public endpoint. A 65k-request walk over the 4-hex prefix space dumps every visible name+pubkey pairing. Add response-cache + IP rate limit; consider requiring a longer prefix (6 bytes matches the data model floor).

Backend efficiency (carmack)

  • cmd/server/rx_dashboard.go:33-49heardKeyResolver is N+1 SQL: one LIKE ? LIMIT 2 per distinct heard_key per request, on the writer-shared connection. Bulk-resolve via one IN (...) + one LIKE-OR.
  • cmd/ingestor/db.go:283-298 / maintenance.go:65-76 — no index on client_receptions.rx_at, yet retention DELETE … WHERE rx_at < ? and the leaderboard WHERE cr.rx_at >= ? both filter on it. Daily prune will devolve to a full scan under the writer lock. Add CREATE INDEX idx_client_recept_rxat ON client_receptions(rx_at).
  • cmd/server/rx_coverage.go:251-265 (mobileRxStats) wired in at handleNodeRxCoverage:354 — second query unbounded by bbox AND unbounded by time on every per-node coverage request. Cannot be served from the bbox-restricted scan. Add a since window matching the user-visible "X days", or precompute per-node summary at write time.

Frontend (tufte)

  • public/node-reach-coverage.css:8-13 — coverage palette only declared in :root (#2ecc71, #e67e22, #e74c3c, #95a5a6). No [data-theme="dark"] override. Mid-luminance saturated colours glare on dark basemaps; the rest of the codebase (TIERS, role colours) keeps theme-aware variants.
  • node-reach-coverage.js:8-15, rx-coverage.js:18-24, both legends — SNR tiers conveyed by hue alone. Reach already establishes the colour + colour-blind glyph pattern; this layer drops it. ~8% of male users can't reliably distinguish #e67e22 vs #e74c3c. Add a hatch/opacity ramp or glyph in tooltip.
  • rx-coverage.js:88-90, 95-100.rxb-row[data-rx] is a <div> with click handler but no role="button", tabindex, or Enter/Space keydown — keyboard + screen-reader users cannot activate observer filtering. Render as <button> or add role+tabindex+key handler.

Tests (kent-beck)

  • test-node-reach-coverage-e2e.js:13-17 silently skips when clientRxCoverage is off — the CI default (commit 191c9980). The end-to-end browser test for the whole feature never runs on CI master. Add a CI job that flips the toggle on for an e2e pass, or land a smaller smoke that runs with the flag on.

MINOR (selected — full lists available on request)

  • client_observers.name stored verbatim, no length cap or sanitisation; leaks to public leaderboard if a companion ships PII (adversarial).
  • idx_client_recept_rxpk redundant with the UNIQUE (rx_pubkey, heard_key, rx_at) constraint's implicit index — drop it (carmack).
  • na.count == 1 || row.RxAt >= na.latestAt — the count == 1 branch is dead ("" < any RFC3339) (carmack).
  • Inline style="..." strings in rx-coverage.js (12 sites) belong in CSS (tufte).
  • Day-range buttons missing aria-pressed; .rxb-row.sel outline reuses --accent which doubles as strong-SNR green (tufte).
  • coverage_query_bench_test.go is perf-only — no behaviour assertion (kent-beck).

Test discipline (kent-beck)

Strong on assertions — every Go _test.go file checks exact counts / status codes / behaviour deltas; no tautologies. Notable: rx_coverage_test.go validates pixel-rendering invariance at res 4..16, cap behaviour with exact counts, and name-precedence independence. Red-then-green history not required for community PRs; N/A.

Clean surfaces noted

  • deriveHeardKey (client_reception.go:114-160) — FLOOD vs DIRECT path attribution matches firmware (Mesh.cpp removeSelfFromPath); 1-byte/2-hex floor consistent with Reach.
  • hexgrid.go clamp/precision behaviour — covered by tests; no NaN/Inf paths.
  • Coverage feature cap + cell node cap — exercised by unit tests with exact-count assertions.

This is an automated review; a human operator will follow up on the BLOCKER (privacy) and MAJOR triage. Bot will not auto-merge or push fixes to a community branch.

@efiten

efiten commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator Author

Review round 2 — both must-fixes addressed (588b4afd)

Good catch — you're right that round 1 hid the node name but not the rest of the identity on two read paths. Both are fixed, each with a test that fails without the gate; the opt-in gate is untouched and the bleed check is clean.

Must-fix #1handleNodeRxCoverage leaked geometry/counts for hidden nodes

The per-node coverage endpoint now mirrors handleNodeReach at the same {pubkey}:

  • isHexPubkey(pubkey)400 on a non-full/non-hex key (parity).
  • s.cfg.IsBlacklisted(pubkey) || s.isPubkeyHidden(pubkey)404 before any query, so the GPS hex bins and mobile_receptions/mobile_clients are no longer retrievable for a blacklisted or hidden-prefix node.

Test: TestNodeRxCoverageHidesBlacklistedAndHidden asserts 404 for both a blacklisted and a 🚫-hidden node.

Must-fix #2/api/rx-leaderboard returned blacklisted observers

rxLeaderboard now post-filters the result set (≤100 rows, no perf concern) with the same helpers the rest of the API uses:

  • IsObserverBlacklisted(rx_pubkey) → row dropped entirely (handles pre-PR rows, blacklist-after-ingest, and config-reload lag).
  • IsBlacklisted(rx_pubkey) || IsNameHidden(name) → name blanked.

Test: TestRxLeaderboardHidesBlacklistedAndHidden covers a normal contributor (kept), an observer-blacklisted one (dropped), and node-blacklisted + hidden-prefix ones (name blanked).

(One test-infra note: setupTestDBv2's nodes table gained foreign_advert so GetNodeByPubkey — which isPubkeyHidden calls — works against the test schema.)

Out-of-scope notes

  • /api/rx-coverage ?node= non-hex up-front rejection and the /api/nodes/resolve per-IP rate-limit: left as the acknowledged follow-ups (rate-limit = follow-up Feed panel overflow:hidden silently clips items instead of scrolling #1 in the PR body), as you flagged them non-blocking.
  • The retention reaper and config-race surfaces you re-verified clean — thanks.

CI green on 588b4afd. Still not merging — that's your call.

Re-spawn review:

Erwin Fiten and others added 4 commits June 16, 2026 15:50
… index (polish review)

The retention reaper's DELETE … WHERE rx_at < ? and the leaderboard's
WHERE rx_at >= ? both filtered on an unindexed column → full scan under the
writer lock (the reaper from 6eba937 introduced the DELETE). Add
idx_client_recept_rxat. Drop idx_client_recept_rxpk: it duplicates the
rx_pubkey-leading index the UNIQUE(rx_pubkey, heard_key, rx_at) constraint
already creates, which serves the ?rx= filter and leaderboard GROUP BY.

Also drop the dead `na.count == 1` guard in aggregateCoverage (latestAt starts
"", so the first row always satisfies rx_at >= latestAt) — no behavior change.

Test: TestClientReceptionsRetentionUsesRxAtIndex asserts the DELETE plan seeks
idx_client_recept_rxat instead of scanning.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…CKER)

The polish review flagged that the per-observer view (/api/rx-coverage?rx=) is an
unauthenticated, fine-grained movement trail of a single contributor, with no
warning in onboarding. Per operator decision this stays an accepted tradeoff
(opt-in, default OFF, fine resolution is what makes the aggregate map useful),
but consent must be informed.

Add a "Privacy — contributor location is public" section: it states plainly that
contributions are world-readable, that the per-observer view reconstructs
movements, and that a pseudonymous companion name does NOT mitigate it (locations
are identifying; the pubkey links all points). Operators are told to warn users;
contributors are told not to use a carried device if that matters. The enabling
steps cross-link to it. Notes the further-hardening levers (lower clientRxDays,
auth proxy, future coarsening / k-anonymity).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…view)

heardKeyResolver issued one `LIKE ? LIMIT 2` per distinct heard_key per request
on the writer-shared read connection. Replace it with heardKeyResolverFor(rows):
collect the distinct heard_keys once and resolve them all in a single round-trip
per 200-key chunk — a UNION ALL of per-prefix `LIMIT 2` subqueries (subquery form
because a bare LIMIT on a UNION ALL term is a SQLite syntax error; prefixes are
hexPrefixRe-validated so literal interpolation is injection-safe). Per-key work
stays bounded at 2 rows, and the Kpa-clawbot#15/Kpa-clawbot#1181 blacklist/hidden hiding is preserved.
resolveHeardKey is now a thin wrapper over the batch (single code path).

Test: TestBatchResolveHeardKeys checks unique/ambiguous/unknown/hidden in one
call; existing resolver + endpoint tests still pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tufte)

- Dark theme: the saturated --nq-cov-* palette only existed in :root and glared
  on dark basemaps. Add [data-theme="dark"] variants (matches the dashboard's
  theme-aware tokens).
- Colour-blind: SNR tiers were hue-only. Add a redundant, monotonic fill-opacity
  ramp (strong>mid>weak>grey) in both coverage layers so orange vs red are
  distinguishable without relying on hue; the per-cell SNR stays in the tooltip.
- Keyboard/SR: the leaderboard .rxb-row was a click-only <div>. Add
  role="button", tabindex=0, aria-pressed, aria-label, an Enter/Space keydown
  handler, and a :focus-visible outline. aria-pressed added to the day buttons.

Tests: test-node-reach-coverage.js asserts the opacity ramp is strictly
decreasing; test-rx-coverage-escape.js asserts the row carries role/tabindex/
aria-pressed. coverageFillOpacity is exported for the unit test.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@efiten

efiten commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator Author

Polish review (4-persona) — triaged; most items fixed

Thanks — genuinely useful sweep. Most of it is now addressed; the rest is deferred with rationale. Opt-in/default-OFF gate untouched, bleed-clean, CI green.

🚫 BLOCKER — per-observer movement trail

Operator decision: kept as a documented tradeoff rather than changing the map — fine resolution is what makes the aggregate map useful, and the feature is opt-in / OFF by default. You're right there's no auth; the gap the review most clearly identifies is informed consent, which was missing. Added a "Privacy — contributor location is public" section to docs/client-rx-coverage.md (e147d8df): it states the per-observer view reconstructs movements, that a pseudonymous companion name does not mitigate it (the locations identify; the pubkey links all points), tells operators to warn users and contributors to avoid carried devices, and documents the hardening levers (lower clientRxDays, auth proxy, coordinate coarsening / k-anonymity). Happy to implement coarsening or a k-anonymity gate on ?rx= if you'd rather have code than a consent doc — your call.

Fixed — backend (carmack)

  • rx_at unindexed — added idx_client_recept_rxat; the retention DELETE … WHERE rx_at < ? and the leaderboard window now seek it (was a full scan under the writer lock — gap from the reaper). (2b41dd25)
  • Redundant idx_client_recept_rxpk dropped (the UNIQUE(rx_pubkey, …) index covers it); dead count==1 branch removed. (2b41dd25)
  • heardKeyResolver N+1 — replaced with a single batched resolve (UNION ALL of per-prefix LIMIT 2 subqueries, one round-trip per 200-key chunk, bounded 2 rows/key, VCR bar unusable on mobile — touch targets below 44px #15 hiding preserved). (c11465a0)

Fixed — frontend a11y/theming (tufte) (d6ea742f)

  • Dark theme: added [data-theme="dark"] --nq-cov-* variants.
  • Colour-blind: added a redundant, monotonic fill-opacity ramp (strong>mid>weak>grey) to both layers, so tiers don't rely on hue alone; per-cell SNR stays in the tooltip.
  • Keyboard/SR: leaderboard rows are now role="button" + tabindex + aria-pressed + aria-label with an Enter/Space handler and a :focus-visible outline; day buttons get aria-pressed.

Pushback

  • mobileRxStats "unbounded scan" — it was rewritten earlier (67358053) to the sargable heard_key IN (…) form, so it's an index seek bounded by the node's own reception count, and retention bounds it in time. Left all-time semantics intentionally (the "heard by N clients" total is more useful unwindowed); can add a ?days= bound if you'd prefer it track the view window.

Deferred follow-ups (with reasons)

  • e2e never runs on CI master (flag off by default) — the fix is a CI workflow change to run the coverage e2e with the toggle on; that's a maintainer infra call (the e2e is already wired to self-skip when disabled), so I left .github/workflows alone. Happy to add the job if you want it.
  • Rate-limits (/api/nodes/resolve, per-rx_pubkey write) — acknowledged follow-up Feed panel overflow:hidden silently clips items instead of scrolling #1.
  • MINORs: client_observers.name length cap/sanitise, the 12 inline-style sites → CSS, and a behaviour assertion on the perf bench. Low-risk; can fold in on request.

CI green on the latest head. Not merging — maintainer's call. Tell me if you want the BLOCKER done in code (coarsening/k-anonymity), the e2e CI job, or any deferred MINOR pulled in.

Re-spawn review:

@Kpa-clawbot

Copy link
Copy Markdown
Owner

Bot watcher — operator decision needed (no further auto-review)

Polish loop hard cap (2 rounds) reached. Verified the 5 fixup commits since round-2 polish (588b4afd, 2b41dd25, e147d8df, c11465a0, d6ea742f) address every backend MAJOR (rx_at index + redundant-index drop, N+1 batched resolver, blacklist/hidden filter on leaderboard + per-node coverage, deterministic order, response caps), every frontend MAJOR (dark-theme palette, opacity ramp for colour-blind, role=button + aria-pressed + keyboard handler), and the dead-code MAJOR (mobileRxStats wired in).

CI ✅ on tip. mergeable=MERGEABLE, mergeStateStatus=CLEAN.

Three open items, all author-deferred to the maintainer:

  1. BLOCKER (privacy — per-observer movement trail at z=18). Author kept the fine-resolution ?rx= view as an accepted tradeoff and shipped a "Privacy — contributor location is public" section in docs/client-rx-coverage.md instead. Code mitigations (k-anonymity gate on ?rx=, server-side lat/lon snap, or dropping the per-observer view) are available on request.
  2. e2e never runs on CI master (flag default-off; test self-skips). Needs a .github/workflows job that flips the toggle on for a coverage e2e pass.
  3. Per-IP rate limit on /api/nodes/resolve (acknowledged follow-up Feed panel overflow:hidden silently clips items instead of scrolling #1 in the PR body).

Minor adversarial nits from the round-2 review remain unaddressed (selectedName dead var, per-request resolver cache unbounded, server-side RetentionConfig.ClientRxDays mirror, empty-pubkey guard in coverageHeardKeyCandidates) — all non-blocking.

Not auto-merging: community PR + unresolved privacy BLOCKER deferred to maintainer.

@Kpa-clawbot Kpa-clawbot left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kent Beck Gate (round 2) — TDD + test quality

Verdict: APPROVED — all 5 follow-up commits land with appropriate tests; no must-fix issues.

Per-commit TDD audit

588b4afd fix(coverage): hide blacklisted/hidden nodes on coverage + leaderboard — ✅

  • TestNodeRxCoverageHidesBlacklistedAndHidden (rx_coverage_endpoint_test.go) asserts 404 for blacklisted and hidden-prefix pubkeys at /api/nodes/{pk}/rx-coverage. Reverting the gate in handleNodeRxCoverage flips the response from 404→200 → assertion fails. ✓
  • TestRxLeaderboardHidesBlacklistedAndHidden (rx_dashboard_test.go) covers three distinct behaviors: observer-blacklisted dropped entirely, node-blacklisted name blanked but row kept, hidden-prefix name blanked but row kept. Reverting the post-loop filter loop flips all three → assertions fail. ✓
  • Six questions: tests assert observable behavior (HTTP code, map of pubkey→row), not internal call order. Setup is wide (4 rows × 3 hide policies) but justified — it's the matrix the fix is designed to cover.

2b41dd25 perf(coverage): index rx_at; drop redundant rx_pubkey index — ✅

  • Perf-exempt per AGENTS.md, but the commit still lands a real perf gate: TestClientReceptionsRetentionUsesRxAtIndex uses EXPLAIN QUERY PLAN and asserts idx_client_recept_rxat appears. A future drop of the index, or a query rewrite that bypasses it, fails the test on the assertion (not on perf flake). This is the right shape for a perf regression test. ✓
  • Drive-by behavior change in aggregateCoverage (drop count==1 guard, rely on lexical >= "") is invariant-preserving but unasserted by a new test. Out-of-scope nit only.

e147d8df docs(coverage): warn contributor location is public — ✅ pure-docs exemption.

c11465a0 perf(coverage): batch heard_key resolution to kill N+1 — ✅

  • Perf-exempt, but TestBatchResolveHeardKeys covers the four resolution semantics (unique, ambiguous, unknown, hidden-prefix) of the new batch function with one call. Reverting batchResolveHeardKeys back to a per-key loop wouldn't fail this test (it asserts correctness, not query count), so the test gates semantic regressions in the new batched code path, not the N+1 itself. Parent flagged this as a "is the query count pinned?" question — answer: not pinned, accepted under perf exemption.
  • resolveHeardKey is now a thin wrapper over the batch path, so existing single-key resolve tests transitively exercise the batched code. Good consolidation.

d6ea742f fix(coverage): a11y + dark-theme for coverage layers — ✅

  • test-node-reach-coverage.js asserts monotonic opacity ramp (strong > mid > weak > grey). Reverting coverageFillOpacity to constant 0.45 fails the strict-inequality assertion. ✓
  • test-rx-coverage-escape.js adds three keyboard-a11y assertions on the rendered row (role="button", tabindex="0", aria-pressed="(true|false)"). Reverting the renderBoard HTML changes fails all three. ✓
  • Six questions: tests name the behavior (opacity must ramp strong>mid>weak>grey, row must be focusable), not the implementation. Setup is minimal.

Six-question pass over test files in this batch

  • cmd/server/rx_coverage_endpoint_test.go: behavior-named, minimal setup, no tautologies.
  • cmd/server/rx_dashboard_test.go: three new tests, each with a single distinct invariant. The BatchResolveHeardKeys cases map is the smallest test that catches each semantic class.
  • cmd/ingestor/client_reception_test.go: EXPLAIN QUERY PLAN is exactly the right tool for the perf invariant; no flake surface.
  • test-node-reach-coverage.js / test-rx-coverage-escape.js: strict-inequality + regex assertions are tight; no implementation-coupling.

Out-of-scope (do not block)

  1. cmd/server/rx_dashboard.go aggregateCoverage removed count==1 guard — behaviorally equivalent (lexical >="" always true on first row) but unasserted; consider a one-line test if revisited.
  2. Dropping idx_client_recept_rxpk is reasoned-about ("UNIQUE leading column"), but no test pins that the ?rx= filter still hits an index. EXPLAIN-QUERY-PLAN-style test would harden it.

@Kpa-clawbot Kpa-clawbot left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Munger Review (round 2)

Cold re-review of the 5 follow-up commits (588b4af, 2b41dd2, e147d8d, c11465a, d6ea742) against the round-1 findings, applied with inversion + second-order incentives.

What the round-2 fixes got right

  • 588b4af correctly closes the two identity leaks: handleNodeRxCoverage now mirrors handleNodeReach (400 on non-hex, 404 on blacklist/hidden) so GPS hex bins and mobile_* counts aren't fetchable at a hidden pubkey, and rxLeaderboard drops IsObserverBlacklisted and blanks names for blacklisted/hidden identities. Test coverage matches both gates. The setupTestDBv2 schema patch (adding foreign_advert) is the right minimal change to make isPubkeyHidden real in tests.
  • 2b41dd2 is a clean swap: idx_client_recept_rxat backs the reaper DELETE WHERE rx_at < ?, and idx_client_recept_rxpk is genuinely redundant because the UNIQUE(rx_pubkey, heard_key, rx_at) constraint already creates an rx_pubkey-leading index. Dropping the dead na.count == 1 guard in aggregateCoverage is correct ("" ≤ any RFC3339).
  • e147d8d says the quiet part out loud: opt-in + default-OFF + fine resolution + per-observer view = informed-consent territory. The doc is explicit that a pseudonymous companion name does not mitigate the trail, names the further-hardening levers, and links it from the enabling steps.
  • c11465a kills the per-key round-trip cleanly. Resolver is built once per request from the rows that will be aggregated, so every key the resolver is asked about is in the batch (no fall-through to a slow path). Per-request scoping means no cross-request cache to poison. The 'pfx' AS pfx echo-back keys the per-prefix aggregator deterministically, and LIMIT 2 per prefix preserves the unique-vs-ambiguous distinction.
  • d6ea742 addresses both round-1 a11y items (keyboard-reachable role="button" rows with aria-pressed + focus-visible, monotonic opacity ramp as a non-hue SNR cue) and the dark-theme palette gap.

Must-fix

  1. cmd/server/rx_dashboard.go (batchResolveHeardKeys, ~lines 78–96): literal-interpolated LIKE is one regex change away from SQL injection. The safety argument ("k is hexPrefixRe-validated") lives in a comment, not in the function. Anyone who later widens hexPrefixRe in node_resolve.go (uppercase, :, etc., for whatever reason) silently turns this into an injection sink. Either (a) parameterize: build ? placeholders for both the pfx literal and the LIKE operand and pass args ...interface{}, or (b) re-assert inside the loop with a function-local stricter check before interpolating. "Show me the incentive": a future contributor optimizes the prefix regex and has no reason to grep this file.

  2. cmd/server/rx_dashboard.go (rxLeaderboard, ~lines 199–215): drop-after-LIMIT shrinks the leaderboard below what the caller asked for. The SQL LIMIT ? is applied before the Go-side IsObserverBlacklisted / IsBlacklisted / IsNameHidden filter. If N of the top-100 contributors are blacklisted observers, the public leaderboard returns 100−N rows — not 100. Not a leak, but a quality regression on a public surface, and the second-order incentive is bad (a blacklisted operator can occupy a top slot for free, just invisibly). Fix options: push NOT IN (observer_blacklist) into the SQL (cheapest), or over-fetch (LIMIT ?*2) and re-cap after filtering.

  3. cmd/ingestor/client_reception_test.go (TestClientReceptionsRetentionUsesRxAtIndex): only the reaper is asserted; the leaderboard plan is not. The commit message and index comment both promise the new index covers WHERE rx_at >= ? in rxLeaderboard, but there's no EXPLAIN QUERY PLAN test for the leaderboard SELECT. SQLite's planner is fond of choosing the UNIQUE-backed rx_pubkey-leading index instead, especially with the GROUP BY rx_pubkey. Add a second EXPLAIN test against the actual leaderboard query (or a representative form) so a future schema tweak doesn't quietly regress to a full scan under the writer lock.

  4. cmd/server/rx_dashboard.go (batchResolveHeardKeys, query error path, ~lines 102–106): silent fallback hides a real bug. On db.Query error, every key in the batch is mapped to (k, "") and the function continues. Operationally this presents as "all resolutions ambiguous" — a coverage map with no names — and there's no log line. Either log the error once (already have a logger on *Server), or surface it up so the handler can decide. Inversion: how would I notice this is broken? — answer: I wouldn't.

Out-of-scope (pre-existing or accepted)

  • The privacy tradeoff itself (per-observer view = movement trail) is an accepted, documented operator decision.
  • clientRxDays retention default lives outside this PR.
  • The companion-side signed-token follow-up is already tracked.

Verdict

NEEDS-WORK — 4 must-fix items, all small. The round-1 BLOCKERs are genuinely closed; the remaining items are hardening against the next change rather than today's behavior, which is the right time to fix them.

efiten and others added 2 commits June 17, 2026 07:50
…og query errors (review r2)

Round-2 must-fix items on the coverage dashboard:
- #1 batchResolveHeardKeys: parameterize the per-prefix LIKE (bound args, no
  literal interpolation) so it stays injection-safe if hexPrefixRe ever widens.
- #2 rxLeaderboard: the SQL LIMIT runs before the Go-side observer-blacklist
  drop, shrinking the public board below the requested limit. Over-fetch by the
  blacklist size and re-cap to limit. Test: blacklisted top contributors no
  longer shrink the result.
- #4 batchResolveHeardKeys: log.Printf the query error instead of swallowing it
  (a silent fallback presents as 'every name ambiguous' with no signal).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ent (review r2 #3)

EXPLAIN shows the leaderboard SELECT is served by the UNIQUE(rx_pubkey,heard_key,
rx_at) covering index, NOT idx_client_recept_rxat. Add an EXPLAIN test that pins
it as index-backed (guards against a regression to a bare table scan under the
writer lock; the table is retention-bounded so a covering scan is fine), and
correct the db.go comment that wrongly claimed rx_at backs the leaderboard window.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@efiten

efiten commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator Author

Round-2 must-fix items addressed (22065640)

  1. batchResolveHeardKeys LIKE injection surface — fully parameterized now: bound ? args for both the pfx echo and the LIKE operand, no interpolation of the prefix. Safe regardless of how hexPrefixRe later evolves.
  2. rxLeaderboard drop-after-LIMIT shrink — over-fetches by the observer-blacklist size (LIMIT limit + len(observerBlacklist)) and re-caps to limit after the Go-side filter, so blacklisted top contributors can't shrink the public board. New test TestRxLeaderboardLimitSurvivesBlacklistDrop (2 blacklisted at the top → still returns exactly limit good rows in order).
  3. Leaderboard query plan test — added TestRxLeaderboardQueryIsIndexBacked (EXPLAIN QUERY PLAN). It confirms your prediction: the leaderboard is served by the UNIQUE(rx_pubkey, heard_key, rx_at) covering index, not idx_client_recept_rxat. That's a covering scan (no table-heap access) on a retention-bounded table — acceptable — and the test guards against a regression to a bare table scan. I also corrected the db.go comment that wrongly claimed idx_client_recept_rxat backs the leaderboard window; that index backs the retention DELETE (separately tested).
  4. Silent query-error fallbackbatchResolveHeardKeys now log.Printfs the query error before falling back to "all ambiguous", so a broken resolver is visible in the logs instead of presenting as a nameless map.

Out-of-scope items left as noted. CI green on the push.

@efiten

efiten commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator Author

Heads-up: the e2e red here is a pre-existing fixture time-bomb on master, not this PR

The Playwright job fails at:

test-issue-1630-reach-mobile-e2e.js:40 — no repeater with reach links found in fixture
    at pickRepeaterWithReach

pickRepeaterWithReach scans /api/nodes?role=repeater and needs one with >= 2 links from /api/nodes/{pk}/reach?days=30. It fails at the data step (before any page render), and the same failure is happening on master independent of this PR:

  • master run 27637454126 (commit 1476b857, 2026-06-16) failed with this exact error;
  • the same commit 1476b857 passed on 2026-06-15.

Same code, pass→fail across one day ⇒ the e2e fixture's reach observations have aged past the rolling 30‑day window, so no repeater qualifies. This PR is Go-only on the coverage dashboard (cmd/server/rx_dashboard.go, batch resolver / leaderboard) — it doesn't touch node_reach.go, /api/nodes, the reach page, or the fixture. Go Build & Test is green here.

Not asking to fix it in this PR (it's infra, and it blocks master too) — flagging so the e2e red isn't attributed to the coverage change. Likely fixes on your side: regenerate the e2e fixture with now-relative reach timestamps, or have pickRepeaterWithReach widen the window / seed a guaranteed reach repeater.

efiten and others added 3 commits June 17, 2026 11:28
Replace raw COUNT(*) leaderboard ranking with a per-observer score that sums
1/(observers covering each ~150m cell), plus cells/score fields. Spam-proof
(parked node = 1 cell) and kills dense-area bias. Aggregated in Go over the
window; ties broken by receptions then pubkey.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… rewrite

The leaderboard no longer GROUPs BY rx_pubkey in SQL (it range-scans rx_at and
aggregates in Go), so the old "served by the UNIQUE index as a COVERING scan"
note was inaccurate. The conclusion stands: a dedicated rx_pubkey index is
still redundant.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Render the Top mobile observers board as a table with score/cells columns,
every column sortable (default score desc), and a hover tooltip on the score
header explaining the frontier weighting. Row click-to-filter preserved.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@efiten efiten force-pushed the feat/client-rx-coverage-pr branch from 386841d to 547e562 Compare June 17, 2026 09:32
deploy-live.sh and deploy-staging.sh contain host-specific deploy logic
(on8ar.eu, mesh-internal, container names) — they belong only on the deploy
host, not in version control. Remove them from the repo and gitignore them so
future deploys (git reset --hard) no longer overwrite the host-local copies.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@Kpa-clawbot

Copy link
Copy Markdown
Owner

Incremental review — commits 2206564..eabbae6

Scope: 1574e087 (doc refresh), 547e5624 (sortable leaderboard table), 997aaee0 (frontier-weighted ranking), eabbae61 (untrack env deploy scripts). Prior code unchanged since 22065640 is not re-reviewed.

Adversarial

  • [MAJOR] cmd/server/rx_dashboard.go:240-258 — query dropped its LIMIT and now streams every reception in the window into Go maps. On a busy week (tens to hundreds of thousands of rows) one unauthenticated /api/rx/leaderboard?days=30 call materialises three maps (obsAgg, per-observer cells/nodes sets, and cellObservers) for the whole table. There is no row cap, no context-cancel check inside the scan loop, and the handler is reachable by any client that satisfies requireClientRxCoverage. → Add a hard SQL LIMIT (e.g. 500k) with a diagnostic log on truncation, or stream-aggregate with an early bail; honour r.Context() between rows.
  • [MAJOR] cmd/server/rx_dashboard.go:309-313 — score dilution counts blacklisted observers. cellObservers[cell] is populated before the blacklist filter, so a hidden/blacklisted observer that also covers cell C lowers every legitimate observer's contribution from 1.0 to 1/N. Operators of blacklisted nodes can therefore silently suppress public scores. → Build cellObservers only from non-blacklisted pubkeys, or compute score in a second pass after filtering.
  • [NIT] public/rx-coverage.js:99-101 — score column tooltip is in Dutch while every other label/title on the page is English; inconsistent UX and harder to grep. → Translate to English (or mirror in both via lang=).
  • [NIT] public/rx-coverage.js:140Number(o.score).toFixed(1) can render two rows with identical "0.5" while ranking them differently (tie-break on receptions then pubkey). Users see "same score, different rank" with no hint why. → Either show two decimals when there are visible ties, or surface the tie-break field via title=.
  • [NIT] cmd/server/rx_dashboard.go:266leaderboardHexRes = 13 is a magic const documented as "~150 m at our latitude". hexSizeForRes(13) is zoom-dependent in display contexts; here it's only used for binning so it's fine, but a passing reader will assume parity with the coverage map's res. → Note explicitly that this res is decoupled from the map's render res and never converted back to a boundary.

Carmack (perf/correctness)

  • [MAJOR] (same as adversarial Feed panel overflow:hidden silently clips items instead of scrolling #1) — moving the GROUP BY out of SQLite into Go is the right call for the per-cell rarity weighting, but the missing row cap is the failure mode. SQLite would have streamed the prior GROUP BY rx_pubkey via the covering index; the Go path is now O(rows) memory with three maps. Cap the result set or pre-aggregate distinct (rx_pubkey, cell, heard_key) triples in SQL first (still no rarity, but kills the duplicate-reception multiplier and the receptions counter loses meaning — so just cap rows).
  • [NIT] cmd/server/rx_dashboard.go:288-303 — each row allocates a string cell via fmt.Sprintf inside the inner loop (in hexCellAt). Hot path. Not a blocker, but if row counts climb, this is the first thing to profile. → Precompute int64 cell ids and use those as map keys; only stringify on egress.
  • [NIT] public/rx-coverage.js:108-117sortBoard() runs on every renderBoard() even when nothing relevant changed (e.g. row click only flips selectedRx). Cheap today but the comparator does .toLowerCase() per call. → Sort only when sort state or data changes.

Kent Beck (TDD gate)

  • Red-then-green visible: NO. 997aaee0 adds both TestRxLeaderboardFrontierScore and the feature in a single commit. Per AGENTS.md TDD rules this is not a refactor, not a pure-docs change, and not a net-new UI surface — it changes ranking semantics on a tested function. The commit should have been split (red test → green impl). The test does pass on the impl and would fail when the impl is reverted to raw COUNT(*), so the test is real, but the discipline is missed. → Future ranking changes: separate red+green commits; the parent will block on this for non-community PRs.
  • Tests assert behavior or shape: behavior (exact ranking order, exact scores 0.5/2.5, cells counts). Good — not a tautology, not implementation-mirroring.
  • Coverage gap: no test for the score-dilution-by-blacklisted-observer case flagged above. Add a fixture with one blacklisted observer sharing a cell and assert the legitimate observer's score stays 1.0 (after the fix), not 0.5.

Munger (incentives/footguns)

  • [MAJOR] (mirrors adversarial setInterval leaks in live.js — timers not cleared on page navigation #2) — the incentive system you just shipped quietly rewards operators of blacklisted nodes with the power to depress everyone else's public scores by parking near them. Worst kind of footgun: silent, asymmetric, exploitable by exactly the actors you're filtering. Fix before next deploy.
  • [NIT] Frontier weighting rewards "first to a cell." A community member acting in good faith who joins late and drives well-covered routes appears uncompetitive forever — no decay, no per-window normalisation. → Document the social tradeoff in the score tooltip ("scores favour expanding coverage; don't be discouraged if your route is already well-covered"). Optional, but the alternative is a confused contributor.
  • [NIT] Untracking deploy-*.sh (eabbae61) + .gitignore entries assumes operators have an out-of-band copy. There's no docs/deploy.md reference left behind; if someone wipes their clone they'll be missing the deploy entrypoint with no breadcrumb. → Add a one-line deploy-*.sh.example template or a docs/DEPLOY.md pointer. Otherwise the next operator rediscovers the wheel.
  • The doc refresh (1574e087) accurately reflects the new aggregation path; no incentive to keep the now-redundant rx_pubkey index. Clean.

Verdict

  • BLOCKER count: 0
  • MAJOR count: 2 (unbounded scan; score dilution by blacklisted observers)
  • Merge-ready (this delta): no — the two MAJORs are both production-data behaviours that surface immediately, not theoretical edge cases. Fix both, add the dilution-regression test, then merge.
  • CI note: Playwright E2E test-issue-1630-reach-mobile-e2e.js failing identically on master since 2026-06-16 — NOT a feat: opt-in mobile client-RX coverage (crowdsourced RF reach) + /api/nodes/resolve #1728 blocker.

…ew r2)

Two production-data MAJORs in the new frontier-weighted leaderboard:

- Unbounded scan: the Go-side rarity weighting dropped the SQL LIMIT, so an
  unauthenticated /api/rx-leaderboard streamed the whole window into maps.
  rxLeaderboard now takes a context.Context (QueryContext + batched ctx.Err()
  checks every 2048 rows), caps the scan at leaderboardScanCap (500k) ORDER BY
  rx_at DESC (keep most recent on truncation), and logs when the cap is hit.

- Score dilution: cellObservers was populated before the blacklist filter, so a
  blacklisted-node operator parked in a cell silently lowered every legitimate
  observer's 1/N frontier weight. The per-cell denominator (cellCount) now
  excludes observer- and node-blacklisted pubkeys; name-hidden (non-blacklisted)
  contributors still count.

Test: TestRxLeaderboardScoreNotDilutedByBlacklisted asserts a legit observer
sharing a cell with a blacklisted one scores 1.0, not 0.5 (fails without the
fix). Existing leaderboard/frontier tests updated for the new ctx parameter.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@efiten

efiten commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator Author

Incremental review (frontier leaderboard) — both MAJORs fixed

Thanks. Both production-data MAJORs are fixed in cmd/server/rx_dashboard.go, with the requested regression test.

MAJOR 1 — unbounded leaderboard scan

rxLeaderboard now:

  • takes a context.Context and uses QueryContext, checking ctx.Err() in batches (every 2048 rows) so a client disconnect/timeout aborts the scan;
  • caps the scan with a hard SQL LIMIT leaderboardScanCap (500k), ORDER BY rx_at DESC so a truncated window keeps the most-recent receptions, and logs [rx-leaderboard] scan hit cap … when truncated.

handleRxLeaderboard passes r.Context().

MAJOR 2 — score dilution by blacklisted observers

The per-cell denominator is now computed from a cellCount map that excludes observer-blacklisted and node-blacklisted pubkeys, so an operator of a blacklisted node parked in a cell can no longer silently lower everyone else's frontier weight. Name-hidden (but not blacklisted) observers are legitimate contributors and still count.

Test (kent-beck)

TestRxLeaderboardScoreNotDilutedByBlacklisted: a blacklisted observer sharing a legit observer's only cell — asserts the legit score stays 1.0 (it's 0.5 without the fix) and the blacklisted observer is absent. (Point taken on the red→green split for ranking changes; noted for future.)

NITs (not done — flagging for your call)

  • Dutch score/cells tooltips: left as-is — they're the author's intentional copy, not mine to change unilaterally; happy to translate to English for codebase consistency if you want.
  • Score .toFixed(1) tie display, hexCellAt fmt.Sprintf in the hot path (precompute int64 cell ids), sortBoard() on every render, and a deploy-*.sh.example/docs/DEPLOY.md breadcrumb — all low-risk; say the word and I'll fold any in.

CI status

Go Build & Test ✅ and every coverage/leaderboard e2e ✅. Playwright is red solely on test-issue-1630-reach-mobile-e2e.js:40 (pickRepeaterWithReach) — confirmed not a #1728 regression: the latest master run (27637454126) fails on the identical line, and that same master SHA (1476b857) passed on 06-15 then failed on 06-16, so it's an environmental/flaky failure independent of this branch. Docker is "skipped" only because it gates on the Playwright job. Nothing in this PR touches the reach-mobile repeater-selection path that test exercises.

Not merging — maintainer's call.

Re-spawn review:

@Kpa-clawbot

Copy link
Copy Markdown
Owner

Round-3 review — both r2 MAJORs verified fixed

Verified f472a3f against the two r2 MAJORs on rxLeaderboard:

MAJOR 1 (unbounded scan) — FIXED. rxLeaderboard now takes context.Context, uses QueryContext, batches ctx.Err() checks every 2048 rows (correct — no per-row mutex), enforces LIMIT leaderboardScanCap (500k) with ORDER BY rx_at DESC (truncation keeps most-recent, which is the right pick for a coverage leaderboard), and logs when the cap fires. handleRxLeaderboard plumbs r.Context(). Clean.

MAJOR 2 (score dilution by blacklisted observers) — FIXED. Pre-pass builds cellCount excluding both IsObserverBlacklisted and IsBlacklisted pubkeys; the score loop reads from cellCount with a c > 0 guard (defensive — handles the "cell with only blacklisted observers" edge cleanly). Name-hidden-but-not-blacklisted observers correctly still count.

Test (kent-beck) — present and correctly behavior-asserting. TestRxLeaderboardScoreNotDilutedByBlacklisted is a real behavior gate: blacklisted "bad" co-located with legit "good", asserts good.Score == 1.0 (would be 0.5 without the fix) and bad absent. The 1e-9 tolerance is appropriate for the 1/N float arithmetic. Existing tests correctly threaded through with context.Background().

Findings

  • 0 BLOCKER
  • 0 MAJOR
  • NITs: none new beyond the ones efiten already flagged for maintainer call (Dutch tooltip copy, toFixed(1) tie display, hexCellAt Sprintf in hot path, sortBoard() on every render, deploy script breadcrumb). All non-blocking.

3-axis merge gate

  • ✅ MERGEABLE
  • ❌ mergeStateStatus = UNSTABLE — Playwright failing on test-issue-1630-reach-mobile-e2e.js:40 (pickRepeaterWithReach), which is the pre-existing master-side fixture time-bomb being addressed by fix(ci): freshen all e2e-fixture observation timestamps (unblocks #1630 reach e2e) #1747 (same author). Confirmed not a regression from this branch (no changes to reach-mobile repeater selection paths).
  • ✅ Review: 0 BLOCKER / 0 MAJOR
  • ✅ Tests present (regression test added in r2 commit)

Not auto-merging — strict gate requires CLEAN, not UNSTABLE. Recommend merging #1747 first to fix the fixture, then this PR's mergeStateStatus should flip to CLEAN on next CI run and become safely auto-mergeable. Maintainer's call.

@efiten

efiten commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator Author

Heads-up on CI: the Playwright E2E red here is not from this PR — it's the pre-existing test-issue-1630-reach-mobile-e2e.js failure that's been failing on master since ~2026-06-16 (the e2e fixture's observations aged out of the 30-day reach window).

Fixed separately in #1747 (a one-line tools/freshen-fixture.sh change, no app code). Once #1747 lands on master and this branch is brought up to date, this PR's Playwright will go green too. Deliberately kept out of this PR to keep its scope to the coverage feature.

Go Build & Test is green here; the Docker job only shows "skipped" because it gates on the Playwright job.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants