Skip to content

Bulletproof pre-publish polish: SDK fixes, server hardening, /metrics, integration tests#1

Merged
anfocic merged 6 commits into
masterfrom
chore/oss-publish-polish
May 9, 2026
Merged

Bulletproof pre-publish polish: SDK fixes, server hardening, /metrics, integration tests#1
anfocic merged 6 commits into
masterfrom
chore/oss-publish-polish

Conversation

@anfocic
Copy link
Copy Markdown
Owner

@anfocic anfocic commented May 8, 2026

Summary

Final pass before publishing the pagetally npm package. SDK gets correctness fixes; server gets the middleware stack a public-internet OSS deploy actually needs; integration tests run against a real Postgres in CI.

Client SDK (client/)

  • stripQueryParams now strips #hash in addition to ?query
  • auto-track listens for hashchange so hash-routed SPAs emit pageviews
  • TTFB is now also captured via a navigation PerformanceObserver (the synchronous read at init returned 0 if the entry wasn't ready yet)
  • bfcache restore (pageshow.persisted) now fires a fresh pageview, so we no longer emit a pageleave with no matching pageview
  • npm hygiene: sideEffects: false, engines.node >=18, publishConfig.access: public, README/LICENSE moved into prepack so npm pack reproduces the published tarball

Server hardening (server/)

  • Per-IP rate limiting via tower_governor + SmartIpKeyExtractor/collect 60 burst / ~120 rpm, /contact 3 burst / ~5 rpm
  • STATS_ORIGINS env scopes CORS for /stats/*; /collect keeps Any
  • Security headers always on: CSP default-src 'none'; frame-ancestors 'none', X-Content-Type-Options, Referrer-Policy, X-Frame-Options. HSTS gated on BEHIND_TLS=1
  • 15s TimeoutLayer; graceful shutdown on SIGTERM/SIGINT drains in-flight inserts
  • Per-field input validation on /collect (site_id ≤64, path ≤2048, referrer ≤253, event name ≤64), NaN/Inf scrubbed from performance metrics, x-country constrained to ASCII alpha
  • x-request-id injected if absent, propagated if the caller sends one
  • LOG_FORMAT=json flips tracing-subscriber to one-line JSON for log shippers
  • GET /metrics (axum-prometheus) for Prometheus scraping; unauthenticated by convention, README spells out the firewall expectation

Tests

  • 12 unit tests (validation, email)
  • 13 integration tests via #[sqlx::test] against a fresh Postgres per test — covers /health, /collect happy path + rejection paths + body-limit, /stats/* with and without admin token, response-header surface, request-id propagation, /metrics text format, pageleave clamping
  • 46 client vitest tests pass

Observability

  • Structured JSON logs (LOG_FORMAT=json)
  • /metrics Prometheus endpoint
  • scripts/loadtest.sh wraps oha for quick collect-burst, collect-spread, stats-read smoke tests

Reference numbers (release build, local Postgres)

Scenario Throughput p99
/collect single-IP burst ~71k rps <3 ms
/stats/summary reads ~20k rps ~5 ms

Treat as floors, not guarantees.

Test plan

  • cargo test — 12 unit + 13 integration green
  • cargo clippy --all-targets -- -D warnings clean
  • cargo audit clean (RUSTSEC-2023-0071 documented inert via server/.cargo/audit.toml)
  • npm test — 46/46
  • npm pack --dry-run — 13.8 KB, 9 files
  • Local end-to-end: server boots, /health, /metrics, /collect from SDK all verified
  • CI green on this branch

anfocic added 6 commits May 8, 2026 12:37
Client SDK fixes:
- stripQueryParams now strips URL fragment (#hash) in addition to ?query, so hash-route state can't leak into stored paths
- auto-track listens for hashchange so hash-routed SPAs emit pageviews
- TTFB now also observed via PerformanceObserver navigation entries; the synchronous read at init returned 0 if the nav entry wasn't ready yet
- Analytics fires a fresh pageview on bfcache restore (pageshow.persisted), so we no longer emit a pageleave with no matching pageview
- performance.ts: hoist cleanup ahead of flush to avoid TDZ-fragile capture

Package hygiene for npm:
- sideEffects: false for tree-shaking
- engines.node >=18
- publishConfig.access: public
- move README/LICENSE copy from prepublishOnly to prepack so npm pack also reproduces a complete tarball
- include README.md and LICENSE in the files allowlist alongside dist/

Docs:
- README: operator hardening checklist (ADMIN_TOKEN, ALLOWED_SITES, rate limit at proxy, x-country trust, access log retention)
- README: privacy notes warning consumers about path-segment PII and arbitrary track() props
- env examples: revert ALLOWED_SITES to commented-out default, matching the public-OSS plug-and-play story
…isory

Add per-field validation at the ingest boundary so the 16KB body limit
can't be spent stuffing one giant value into an indexed column, and so
malformed metric floats can't poison stats queries:

- site_id: 1..=64 chars
- path: 1..=2048 chars
- referrer: <=253 chars
- event name: 1..=64 chars
- performance metrics: NaN / +-Infinity dropped (Postgres percentile_cont
  panics on NaN, and the values are useless either way)
- x-country: require ASCII alpha, length 2 (previously any 2-byte
  unicode would slip through and be uppercased)

Cover the new logic with unit tests in `types::tests`.

Also add `server/.cargo/audit.toml` ignoring RUSTSEC-2023-0071. The
`rsa` Marvin-attack advisory reaches us only via `sqlx-mysql`, which we
disable at compile time (`sqlx` is built with `default-features = false`
and only the `postgres` driver). `cargo audit` reads `Cargo.lock` and
flags the unused crate anyway. The note documents the rationale and
flags revisiting once `rsa >= 0.10` ships.
…shutdown

Defense-in-depth pass. Everything ships on by default with sensible
limits; operators can dial it down via env.

Per-IP rate limiting (tower_governor with SmartIpKeyExtractor):
- /collect: 60 burst, ~120 req/min steady — generous because SPAs fire
  pageleave + pageview close together on every route change
- /contact: 3 burst, ~5 req/min — tight, this endpoint sends real email
  through Resend and is the obvious abuse target
- Reads X-Forwarded-For / X-Real-IP first, falls back to TCP peer via
  ConnectInfo<SocketAddr>. Make sure a reverse proxy sets one of those.

CORS:
- /collect keeps Any origin (browsers POST from every embedding site)
- /stats/* honors STATS_ORIGINS env (comma-separated list). Defaults
  to Any to avoid breaking existing setups, but the README now nudges
  operators to lock it down to their dashboard origin.

Security response headers (always on):
- X-Content-Type-Options: nosniff
- Referrer-Policy: no-referrer
- X-Frame-Options: DENY
HSTS gated on BEHIND_TLS=1 so it doesn't ship on plain-HTTP dev hosts.

Reliability:
- 15s request timeout via TimeoutLayer — bounds tail latency on a slow
  Postgres so the pool can't get pinned forever.
- Graceful shutdown on SIGTERM/SIGINT via with_graceful_shutdown so
  in-flight inserts get a chance to drain instead of being killed mid-
  query during a deploy.

Config:
- New env vars STATS_ORIGINS, BEHIND_TLS documented in README and
  deploy/pagetally.env.example.

Tests: cargo test 12 pass, clippy --all-targets -D warnings clean,
cargo audit clean (modulo the previously-documented inert
RUSTSEC-2023-0071).
Integration tests (server/tests/api.rs, gated on DATABASE_URL via the
sqlx::test macro that builds a fresh DB per test from migrations):

- /health responds 200 ok
- /collect inserts the expected row for a pageview
- /collect rejects an unknown site_id when ALLOWED_SITES is set (403)
- /collect rejects an oversize path (400)
- /collect rejects a request that blows the 16KB body cap
- /stats/* gives 401 without and with the wrong bearer token
- /stats/summary returns rows once events land for the requested site
- pageleave dur is server-clamped to the 30-minute ceiling
- security headers (CSP, X-Frame-Options, X-Content-Type-Options,
  Referrer-Policy) and x-request-id are emitted on every response
- when the caller already sets x-request-id, we propagate it instead of
  rolling a new uuid (request-correlation across reverse proxies)

Middleware additions in lib.rs:

- Content-Security-Policy: default-src 'none'; frame-ancestors 'none'.
  Every response is JSON or plain text — nothing we serve should ever
  load subresources or execute script, and 'frame-ancestors' restates
  the X-Frame-Options posture for browsers that drop the legacy header.
- SetRequestIdLayer + PropagateRequestIdLayer using MakeRequestUuid so
  every request has a stable id in logs and the same id is echoed back
  to the caller.
- TraceLayer customized with explicit make_span / on_response at INFO
  so structured access logs are predictable instead of relying on the
  defaults silently changing across tower-http versions.

Cargo: tower-http gains the request-id and util features. tower gets
its util feature for ServiceExt::oneshot in tests. Adds
http-body-util as a dev-dep for body-collecting in tests.

Verification: cargo test now runs 12 unit + 12 integration tests
(24 total) green against a real Postgres. clippy --all-targets
-D warnings clean. CI already provisions Postgres so this lights up
without further config.
Logging:
- LOG_FORMAT=json switches the tracing subscriber from pretty text to
  one-line JSON records, the form most log shippers (Loki, Vector,
  Datadog, CloudWatch) ingest directly. Each line carries a timestamp,
  level, target, message, and any structured fields (including the
  per-request span id). Default stays text so a `cargo run` is still
  human-readable.

Load testing (scripts/loadtest.sh, requires oha):
- collect-burst    sustained POST /collect from one IP — confirms the
                   rate-limit kicks in (lots of 429s) without buckling
- collect-spread   parallel runs from distinct X-Forwarded-For values
                   to exercise the real ingest path
- stats-read       hammer /stats/summary, optionally with ADMIN_TOKEN

Reference numbers from a release build against local Postgres are
captured in the README so we can tell when a change regresses
throughput by an order of magnitude. They are smoke-test floors, not
throughput guarantees.

README: new env vars (LOG_FORMAT, RUST_LOG) added to the configuration
table and a Load testing section pointing at the script.
axum-prometheus drops a tower layer onto every route, recording the
standard HTTP histogram (request count, latency, status, route label)
into a process-global recorder, and serves the textfile-format snapshot
on GET /metrics. Operators get latency / error rate / per-route traffic
without bolting on anything else.

Two router constructors so tests stay parallel-safe:

- router(state)              — no metrics, used by all integration tests
- router_with_metrics(state) — installs the Prometheus recorder and the
                               /metrics route. Only main.rs calls it,
                               and exactly once per process.

The recorder is global (metrics-rs design); calling pair() twice would
panic. The split keeps test isolation clean without resorting to a
OnceLock dance.

/metrics is intentionally unauthenticated — Prometheus convention is to
expose it on an internal interface and rely on the network boundary.
The README spells this out so an operator who forgets to firewall it
isn't surprised.

Tests: 13 integration tests now (added one that asserts /metrics
renders the expected Prometheus textfile).
@anfocic anfocic merged commit 18b4e8d into master May 9, 2026
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant