Bulletproof pre-publish polish: SDK fixes, server hardening, /metrics, integration tests#1
Merged
Merged
Conversation
Client SDK fixes: - stripQueryParams now strips URL fragment (#hash) in addition to ?query, so hash-route state can't leak into stored paths - auto-track listens for hashchange so hash-routed SPAs emit pageviews - TTFB now also observed via PerformanceObserver navigation entries; the synchronous read at init returned 0 if the nav entry wasn't ready yet - Analytics fires a fresh pageview on bfcache restore (pageshow.persisted), so we no longer emit a pageleave with no matching pageview - performance.ts: hoist cleanup ahead of flush to avoid TDZ-fragile capture Package hygiene for npm: - sideEffects: false for tree-shaking - engines.node >=18 - publishConfig.access: public - move README/LICENSE copy from prepublishOnly to prepack so npm pack also reproduces a complete tarball - include README.md and LICENSE in the files allowlist alongside dist/ Docs: - README: operator hardening checklist (ADMIN_TOKEN, ALLOWED_SITES, rate limit at proxy, x-country trust, access log retention) - README: privacy notes warning consumers about path-segment PII and arbitrary track() props - env examples: revert ALLOWED_SITES to commented-out default, matching the public-OSS plug-and-play story
…isory Add per-field validation at the ingest boundary so the 16KB body limit can't be spent stuffing one giant value into an indexed column, and so malformed metric floats can't poison stats queries: - site_id: 1..=64 chars - path: 1..=2048 chars - referrer: <=253 chars - event name: 1..=64 chars - performance metrics: NaN / +-Infinity dropped (Postgres percentile_cont panics on NaN, and the values are useless either way) - x-country: require ASCII alpha, length 2 (previously any 2-byte unicode would slip through and be uppercased) Cover the new logic with unit tests in `types::tests`. Also add `server/.cargo/audit.toml` ignoring RUSTSEC-2023-0071. The `rsa` Marvin-attack advisory reaches us only via `sqlx-mysql`, which we disable at compile time (`sqlx` is built with `default-features = false` and only the `postgres` driver). `cargo audit` reads `Cargo.lock` and flags the unused crate anyway. The note documents the rationale and flags revisiting once `rsa >= 0.10` ships.
…shutdown Defense-in-depth pass. Everything ships on by default with sensible limits; operators can dial it down via env. Per-IP rate limiting (tower_governor with SmartIpKeyExtractor): - /collect: 60 burst, ~120 req/min steady — generous because SPAs fire pageleave + pageview close together on every route change - /contact: 3 burst, ~5 req/min — tight, this endpoint sends real email through Resend and is the obvious abuse target - Reads X-Forwarded-For / X-Real-IP first, falls back to TCP peer via ConnectInfo<SocketAddr>. Make sure a reverse proxy sets one of those. CORS: - /collect keeps Any origin (browsers POST from every embedding site) - /stats/* honors STATS_ORIGINS env (comma-separated list). Defaults to Any to avoid breaking existing setups, but the README now nudges operators to lock it down to their dashboard origin. Security response headers (always on): - X-Content-Type-Options: nosniff - Referrer-Policy: no-referrer - X-Frame-Options: DENY HSTS gated on BEHIND_TLS=1 so it doesn't ship on plain-HTTP dev hosts. Reliability: - 15s request timeout via TimeoutLayer — bounds tail latency on a slow Postgres so the pool can't get pinned forever. - Graceful shutdown on SIGTERM/SIGINT via with_graceful_shutdown so in-flight inserts get a chance to drain instead of being killed mid- query during a deploy. Config: - New env vars STATS_ORIGINS, BEHIND_TLS documented in README and deploy/pagetally.env.example. Tests: cargo test 12 pass, clippy --all-targets -D warnings clean, cargo audit clean (modulo the previously-documented inert RUSTSEC-2023-0071).
Integration tests (server/tests/api.rs, gated on DATABASE_URL via the sqlx::test macro that builds a fresh DB per test from migrations): - /health responds 200 ok - /collect inserts the expected row for a pageview - /collect rejects an unknown site_id when ALLOWED_SITES is set (403) - /collect rejects an oversize path (400) - /collect rejects a request that blows the 16KB body cap - /stats/* gives 401 without and with the wrong bearer token - /stats/summary returns rows once events land for the requested site - pageleave dur is server-clamped to the 30-minute ceiling - security headers (CSP, X-Frame-Options, X-Content-Type-Options, Referrer-Policy) and x-request-id are emitted on every response - when the caller already sets x-request-id, we propagate it instead of rolling a new uuid (request-correlation across reverse proxies) Middleware additions in lib.rs: - Content-Security-Policy: default-src 'none'; frame-ancestors 'none'. Every response is JSON or plain text — nothing we serve should ever load subresources or execute script, and 'frame-ancestors' restates the X-Frame-Options posture for browsers that drop the legacy header. - SetRequestIdLayer + PropagateRequestIdLayer using MakeRequestUuid so every request has a stable id in logs and the same id is echoed back to the caller. - TraceLayer customized with explicit make_span / on_response at INFO so structured access logs are predictable instead of relying on the defaults silently changing across tower-http versions. Cargo: tower-http gains the request-id and util features. tower gets its util feature for ServiceExt::oneshot in tests. Adds http-body-util as a dev-dep for body-collecting in tests. Verification: cargo test now runs 12 unit + 12 integration tests (24 total) green against a real Postgres. clippy --all-targets -D warnings clean. CI already provisions Postgres so this lights up without further config.
Logging:
- LOG_FORMAT=json switches the tracing subscriber from pretty text to
one-line JSON records, the form most log shippers (Loki, Vector,
Datadog, CloudWatch) ingest directly. Each line carries a timestamp,
level, target, message, and any structured fields (including the
per-request span id). Default stays text so a `cargo run` is still
human-readable.
Load testing (scripts/loadtest.sh, requires oha):
- collect-burst sustained POST /collect from one IP — confirms the
rate-limit kicks in (lots of 429s) without buckling
- collect-spread parallel runs from distinct X-Forwarded-For values
to exercise the real ingest path
- stats-read hammer /stats/summary, optionally with ADMIN_TOKEN
Reference numbers from a release build against local Postgres are
captured in the README so we can tell when a change regresses
throughput by an order of magnitude. They are smoke-test floors, not
throughput guarantees.
README: new env vars (LOG_FORMAT, RUST_LOG) added to the configuration
table and a Load testing section pointing at the script.
axum-prometheus drops a tower layer onto every route, recording the
standard HTTP histogram (request count, latency, status, route label)
into a process-global recorder, and serves the textfile-format snapshot
on GET /metrics. Operators get latency / error rate / per-route traffic
without bolting on anything else.
Two router constructors so tests stay parallel-safe:
- router(state) — no metrics, used by all integration tests
- router_with_metrics(state) — installs the Prometheus recorder and the
/metrics route. Only main.rs calls it,
and exactly once per process.
The recorder is global (metrics-rs design); calling pair() twice would
panic. The split keeps test isolation clean without resorting to a
OnceLock dance.
/metrics is intentionally unauthenticated — Prometheus convention is to
expose it on an internal interface and rely on the network boundary.
The README spells this out so an operator who forgets to firewall it
isn't surprised.
Tests: 13 integration tests now (added one that asserts /metrics
renders the expected Prometheus textfile).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Final pass before publishing the
pagetallynpm package. SDK gets correctness fixes; server gets the middleware stack a public-internet OSS deploy actually needs; integration tests run against a real Postgres in CI.Client SDK (
client/)stripQueryParamsnow strips#hashin addition to?queryhashchangeso hash-routed SPAs emit pageviewsnavigationPerformanceObserver (the synchronous read at init returned 0 if the entry wasn't ready yet)pageshow.persisted) now fires a fresh pageview, so we no longer emit a pageleave with no matching pageviewsideEffects: false,engines.node >=18,publishConfig.access: public, README/LICENSE moved intoprepacksonpm packreproduces the published tarballServer hardening (
server/)tower_governor+SmartIpKeyExtractor—/collect60 burst / ~120 rpm,/contact3 burst / ~5 rpmSTATS_ORIGINSenv scopes CORS for/stats/*;/collectkeepsAnydefault-src 'none'; frame-ancestors 'none', X-Content-Type-Options, Referrer-Policy, X-Frame-Options. HSTS gated onBEHIND_TLS=1TimeoutLayer; graceful shutdown on SIGTERM/SIGINT drains in-flight inserts/collect(site_id ≤64, path ≤2048, referrer ≤253, event name ≤64), NaN/Inf scrubbed from performance metrics,x-countryconstrained to ASCII alphax-request-idinjected if absent, propagated if the caller sends oneLOG_FORMAT=jsonflips tracing-subscriber to one-line JSON for log shippersGET /metrics(axum-prometheus) for Prometheus scraping; unauthenticated by convention, README spells out the firewall expectationTests
#[sqlx::test]against a fresh Postgres per test — covers/health,/collecthappy path + rejection paths + body-limit,/stats/*with and without admin token, response-header surface, request-id propagation,/metricstext format, pageleave clampingObservability
LOG_FORMAT=json)/metricsPrometheus endpointscripts/loadtest.shwrapsohafor quickcollect-burst,collect-spread,stats-readsmoke testsReference numbers (release build, local Postgres)
/collectsingle-IP burst/stats/summaryreadsTreat as floors, not guarantees.
Test plan
cargo test— 12 unit + 13 integration greencargo clippy --all-targets -- -D warningscleancargo auditclean (RUSTSEC-2023-0071 documented inert viaserver/.cargo/audit.toml)npm test— 46/46npm pack --dry-run— 13.8 KB, 9 files/health,/metrics,/collectfrom SDK all verified