Releases: vmkteam/topsrv
v0.1.3
v0.1.2
Fixes
- Hard-cap nginx URI labels at 240 bytes (UTF-8-safe trim) — stops
label_value_too_longfrom the downstream Prometheus gateway. - Detect nginx-escaped binary (
\xHHprintable text from TLS handshakes on port 80) and route to/:invalidinstead of inflating per-URI metric cardinality. - Collapse 32+ char base64url-like segments with uppercase to
/:rest(session ids, magic links). All-lowercase slugs stay unchanged.
Operability
- Auto-update keeps 2 previous binaries on disk instead of 5.
v0.1.1
v0.1.0
Highlights
The 0.1.0 line is the first minor-version bump and lands four major capabilities on top of the existing system/disk/network/postgres/nginx foundation:
- A new bot-logs analytics agent that classifies nginx access-log traffic against 38 bot families and ships gzipped ndjson events to the topsrv.io ingest endpoint with disk-backed WAL retry.
- Full Angie integration: JSON API status (per-zone, per-upstream, SSL, cache, rate-limit, slabs), ACME state machine, multi-host SSL exposure via SAN info-metric, and a fix for ssl-listen auto-discovery that was emitting unreachable
http://host:443/URLs. - Netstat listening ports with public/private/loopback scope classification, UDP coverage, and process attribution — operators can answer "what is exposed on this host?" at a glance.
- A round of postgres reliability fixes (PG18 schema probing, slow app-name cache leak) and an update-supervisor fix that stops mistaking SIGTERM/manual restarts for crash loops.
✨ New features
Bot-logs agent (internal/topsrv/botlog, opt-in)
Ship UA-classified bot events from every tailed nginx/angie access log to https://push.topsrv.io/v1/bot-logs as gzipped ndjson batches. Disk-backed WAL spool retries transient send failures; permanent 4xx batches are dropped with reason-labelled metrics so alerts stay actionable.
- UA classifier with 38 bot families and 100+ patterns covering Russian/CIS, Asian, AI 2026, SEO, social link-previews, and archive crawlers. TikTok folded under bytespider for FCrDNS consistency on the receiver side.
- Auto-detected field aliases from
log_format— works withcombined,combined_plus, key-value, logfmt, and JSON (escape=json) formats out of the box.$realip_remote_addr,$http_referrertypo,refJSON key all resolve correctly. Override per field via[BotLogs.FieldAliases]TOML. - Hardened pusher: per-pass transient-failure cap so a poison batch can't stall the spool; atomic spool writes (tmp+rename) + orphan sweep on startup; symlink/foreign-UID rejection on spool replay; 0o700/0o600 perms so a local writer can't forge ingest events under the agent's Bearer token; HTTP error bodies sanitized of Bearer tokens and JSON token fields before they reach logs.
- Graceful shutdown drains the queue in parallel with HTTP idle-conn shutdown under a 15 s budget; final batches go straight to the WAL on timeout instead of retry-chaining past
App.Shutdown. - Event payload: per-event
host(requestHostheader, normalized vianet.SplitHostPortwith bracketed-IPv6 support + 256-byte cap) shipped alongsideserverName(vhost config), so receiver-side joins pick the right one.Event.URIcarries the un-normalized URL with query string (capped atURITruncate=2048, IE-legacy URL limit covering >99% of legitimate traffic). - Operational metrics:
topsrv_botlog_events_total{state,reason},match_total{family},send_errors_total{kind},batch_duration_seconds,queue_depth,spool_files,spool_bytesplus a startuptopsrv_collector_config_warnings_total{kind}counter that surfaces misconfig (no$http_user_agentin any tailed format, missing/truncatedExtractFields, high-cardinality labels) post-hoc rather than only in stdout.
Angie integration (internal/topsrv/angie)
The angie (nginx fork) JSON API gets a dedicated collector covering features nginx-free doesn't expose:
/status/collector:topsrv_angie_up, per-zone request/response/SSL counters, per-upstream peer state (up/down/unavailable/recovering/busy) + per-peer health/keepalive, per-cache hit/stale/miss/expired/bypass, rate-limit (limit_conns/limit_reqs) results, shared-memory slab pages. Auto-discovery parsesapi /status/;andstub_statusdirectives — API wins when both are present.- ACME collector (
/status/http/acme_clients/?date=epoch): emitstopsrv_angie_acme_state{name,state,certificate}andtopsrv_angie_acme_next_run_seconds{name}. 404-silent on hosts withoutacme_clientconfigured. Dashboards can alert oncertificate!="valid",state="error", or stuck renewal. - SSL certificates: discovery scans
acme_clientdirectives and picks up/var/lib/angie/acme/<name>/certificate.pemfor every client; rejects path-traversal in directive names. Files land in the existingtopsrv_ssl_certificate_expiry_secondsmetric — no new metric, no dashboard split. Multi-host (SAN) certs are no longer reduced to their CN: a new info metrictopsrv_ssl_certificate_san_info{path,domain}enumerates every DNS name inCN ∪ SANs(dedup, CN first; convention follows blackbox_exporter'sprobe_ssl_last_chain_info). - Auto-discovery now prefers non-ssl listen: when a server block has both
listen 80;andlisten 443 ssl;, the api/stub_status URL is built against 80 (HTTP scrape against the SSL port returned HTTP 400 and broke both the api collector and the ACME collector that reuses the URL). When only an ssl listen exists, a warning is logged at startup so operators see the misconfig before scrape errors land.
Netstat listening ports + scope classification (internal/topsrv/netstat)
topsrv_netstat_listening_ports{port,proto,family,scope,process}— one series per active TCP LISTEN and bound UDP socket. UDP failures are non-fatal: a log line is emitted, TCP results still ship, so a kernel quirk on one protocol can't disable both.- Scope classification buckets the bind address into
loopback/private(RFC1918, RFC4193 ULA, RFC6598 CGNAT, link-local) /public(routable or0.0.0.0/::wildcard — worst case). - Process attribution via a per-scrape PID cache; empty under kernel ACL (run agent as root for full visibility), with negative caching for missing PIDs.
- Cardinality cap: 256 series per scrape so a compromised host with thousands of listeners cannot blow Prometheus budgets; single truncation warning when hit.
topsrv_netstat_tcp_connectionsgains aremote_scopelabel using the same classifier (LISTEN sockets carryremote_scope=none), so dashboards can alert on public-inbound (scan exposure) or private→public outbound (exfil signal) without per-IP cardinality.
🛠️ Reliability fixes
PostgreSQL collector — PG18 + schema probing (internal/topsrv/postgres)
- Replace hardcoded schema assumptions with
to_regclassprobes viarelHasColumn. Fixes42P01(«relation does not exist») whenswitchToLargestDBlands in a database wherepg_stat_statementsis not installed. - Skip removed columns:
pg_stat_wal.wal_write_time/wal_sync_timeremoved in PG18 (moved intopg_stat_io); fixes42703on every scrape.collectStatWALbuilds SELECT-list and Scan args dynamically so a single code path serves PG14..PG18. - Drop
total_timefallback inpg_stat_statements: column was renamed in PG13 / extension 1.8, so falling back to it triggered42703once the extension was actually loaded.collectStatementsearly-returns whenstatementsTimeColis empty so unsupported installs stop spamming errors each scrape. - Introduce
versionPG18+relPgStat*name constants so a typo can't silently disable feature detection. Verified on PG15/PG17/PG18 in orbstack: zero ERROR lines.
PostgreSQL — fix slow leak in app-name cache
appNames accumulated (queryid, application_name) pairs forever on every sample tick: a process restart with a new pid/uuid in application_name minted a fresh entry; RSS grew ~70 MB/day on busy hosts.
- Store last-seen time per pair (
map[int64]map[string]time.Time); addpruneAppNamesto evict pairs older thanappNamesTTL=1hand drop newly-empty queryid sub-maps. - Run pruning on a separate 5-minute ticker; full walk of a 250k-entry map is ~14 ms, well below the existing 2 s sample budget.
Update supervisor — skip non-crash restarts in rollback check
- Add
Gracefulflag toupdateState; mark it onctx.Done()so SIGTERM / manual restarts inside the post-update window no longer bumpRestartCountand trip a false crash-loop rollback. - Run
markStableAftergoroutine that zerosRestartCountonce the new binary has been alive for 60 s, capping the brittle early-life window in which restarts can accumulate. - Guard the defer with
markGracefulIfCancelledso a panic inRuncannot mask itself as a clean shutdown. - Extract
attemptRollbackso the success path is unit-testable withoutos.Exit; clearLastUpdate/RestartCount/Gracefulafter a successful rollback to stop the supervisor from re-tripping rollback against the same binary.
📦 Operator notes
- Bot-logs is opt-in: set
[BotLogs] Enabled = trueand provide[Push].Endpoint. Without that, none of the new bot-log code runs and no extra log files are tailed. - The angie ssl-listen fix is silent for operators with already-working setups; only hosts where auto-discovery was previously picking the SSL port will see the metric come back online without config changes.
- The new
san_infometric increases push-payload size by ~N series per multi-host cert (typical: 3–5 SANs per cert). Negligible at any realistic fleet size. - No metric removals, no Prometheus label removals. All existing dashboards continue to work.
v0.0.21
What's Changed
-
PostgreSQL bloat estimation — new metrics
topsrv_pg_table_bloat_size_bytes,topsrv_pg_table_bloat_pct,topsrv_pg_index_bloat_size_bytes,topsrv_pg_index_bloat_pct(top 50 per type), refreshed every 15 min.- Uses the canonical ioguix heuristic (
pg_class+pg_statsarithmetic) — catalog-only, never reads heap pages. Safe on multi-TB databases. Nopgstattuple, no shared_buffers eviction. - Scans run in parallel via a stampede barrier (pool has 3 connections, the two catalogs don't contend).
- Cache is replaced unconditionally on success so metrics clear after
pg_repack/REINDEX. On total failure the refresh timestamp is reset so the next scrape retries immediately. - Measured cost on a prod-like cluster: ~108 ms warm (tables) + ~260 ms warm (indexes). Absorbed once per 15 min instead of every scrape.
- Gatesrv dashboards consume these in existing Tables and Indexes tabs (no new tab).
- Uses the canonical ioguix heuristic (
-
Integration test stability — pre-generate the self-signed cert on the host (
scripts/gen-test-certs.sh) and mount it into nginx + angie containers read-only. Previouslyapk add opensslinside the Alpine image took ~90 s and raced the 23 s healthcheck window, leaving nginx + angie unhealthy and four integration tests flaky. The host cert is reused for 365 days and regenerated automatically when under 7 days remain.
Changelog
v0.0.20
What's Changed
-
Major PostgreSQL collector overhaul — split the 1300-line
postgres.gointo a dedicatedinternal/topsrv/postgres/subpackage (8 files + tests). Adds metrics for:- Wait events (sampled from
pg_stat_activity) — pganalyze/APM-style «why is it slow right now» pg_stat_archiver(topsrv_pg_archiver_total{result}+last_timestamp_seconds)pg_stat_walon PG14+ (wal_records_total,wal_fpi_total,wal_buffers_full_total,wal_io_time_seconds_total)- Locks extended —
grantedlabel,topsrv_pg_blocked_backends,topsrv_pg_lock_wait_seconds_max - Replication stages —
topsrv_pg_replication_lag_seconds{stage=write|flush|replay}+replication_sync_state - Indexes (top 50) —
topsrv_pg_index_scans_total,topsrv_pg_index_size_bytes - Table maintenance —
last_maintenance_timestamp_seconds{op=vacuum|analyze},mod_since_analyze - Settings — curated GUCs normalised to bytes/seconds
- Stats reset timestamps for detecting unexpected resets
- Outlier-aware histogram for pg_stat_statements using
min_exec_time/max_exec_time application_name=topsrvon pool connections + auto-switch to the largest database for per-table views- Breaking:
topsrv_pg_locksgains{granted},topsrv_pg_replication_lag_secondsgains{stage}
- Wait events (sampled from
-
Capture write-heavy queries in top-N selection —
pg_stat_statementsunion now covers 5 dimensions (time, calls, blks_read, blks_dirtied, wal_bytes). Previously top-N skipped queries with few calls but lots of DML/WAL churn — now they show up in Queries/Statements tabs. Meta push to gatesrv also widens to top-30. -
Lazy PostgreSQL init + retry —
NewCollectorno longer opens connections at startup. The pool is created on firstCollect()viaensureReady, so topsrv stays useful when PG is temporarily unreachable (boot-time ordering with systemd).topsrv_pg_upreports0until PG comes back; no restart needed.pg_stat_activitysampler no longer permanently disables after a transient error — rate-limited logging, keeps retrying.- New
[Postgres].Disabled = trueconfig option skips PG monitoring even when auto-discovery finds a local process. - Better boundary handling in
switchToLargestDB— new pool created before old one closes, so a failed reconnect no longer leaves a closed-pool reference.
-
Self-monitoring metrics — every registered collector is wrapped in an
instrumentedCollector:topsrv_collector_scrape_duration_seconds{collector}— spot slow collectors (alert>5s= monitoring is adding overhead)topsrv_collector_scrape_panics_total{collector}— any non-zero rate = collector bug, page immediately- Panics in one collector no longer break the
/metricsresponse - Also logs
push: disabledwhen[Push].Endpointis empty so operators know why metrics aren't leaving the box
-
Structured logging — migrated all
Printf/Errorfcalls to embedlog's slog-stylePrint(ctx, msg, key, val)/Error(ctx, msg, "error", err). Log output is now queryable key/value pairs — fits VM/ELK/Loki tagging workflows.
Changelog
v0.0.19
v0.0.18
What's Changed
- Reduce nginx URI metrics cardinality — filter bot scanners and normalize PHP filenames:
- Non-printable URIs (TLS garbage, SSH probes) →
/:invalid - Scanner probes (
.env,.git,.aws,.ssh,.svn,.bak,.sql) →/:bot-scanners - PHP filenames →
:file.php(/shell.php→/:file.php,/wiki/index.php→/wiki/:file.php) - ~50% cardinality reduction on public-facing sites
- Non-printable URIs (TLS garbage, SSH probes) →
v0.0.17
What's Changed
- Network link speed metric — new
topsrv_network_speed_bytes{interface}gauge reports physical NIC link speed in bytes/sec. Only physical NICs (detected via sysfs device symlink), virtual interfaces skipped. Enables utilization alerting:rate(topsrv_network_bytes_total[5m]) / topsrv_network_speed_bytes > 0.8 - Export
CGO_ENABLED=0globally in Makefile
v0.0.16
Full Changelog: v0.0.15...v0.0.16