Skip to content

Releases: vmkteam/topsrv

v0.1.3

19 May 21:05
80f450a

Choose a tag to compare

Changelog

  • 80f450a Fix normalizeURI bypass for binary $request (#17)

v0.1.2

19 May 16:58
fecf14f

Choose a tag to compare

Fixes

  • Hard-cap nginx URI labels at 240 bytes (UTF-8-safe trim) — stops label_value_too_long from the downstream Prometheus gateway.
  • Detect nginx-escaped binary (\xHH printable text from TLS handshakes on port 80) and route to /:invalid instead of inflating per-URI metric cardinality.
  • Collapse 32+ char base64url-like segments with uppercase to /:rest (session ids, magic links). All-lowercase slugs stay unchanged.

Operability

  • Auto-update keeps 2 previous binaries on disk instead of 5.

v0.1.1

15 May 15:53
f30a0ea

Choose a tag to compare

Changelog

v0.1.0

14 May 07:16
01bc23d

Choose a tag to compare

Highlights

The 0.1.0 line is the first minor-version bump and lands four major capabilities on top of the existing system/disk/network/postgres/nginx foundation:

  • A new bot-logs analytics agent that classifies nginx access-log traffic against 38 bot families and ships gzipped ndjson events to the topsrv.io ingest endpoint with disk-backed WAL retry.
  • Full Angie integration: JSON API status (per-zone, per-upstream, SSL, cache, rate-limit, slabs), ACME state machine, multi-host SSL exposure via SAN info-metric, and a fix for ssl-listen auto-discovery that was emitting unreachable http://host:443/ URLs.
  • Netstat listening ports with public/private/loopback scope classification, UDP coverage, and process attribution — operators can answer "what is exposed on this host?" at a glance.
  • A round of postgres reliability fixes (PG18 schema probing, slow app-name cache leak) and an update-supervisor fix that stops mistaking SIGTERM/manual restarts for crash loops.

✨ New features

Bot-logs agent (internal/topsrv/botlog, opt-in)

Ship UA-classified bot events from every tailed nginx/angie access log to https://push.topsrv.io/v1/bot-logs as gzipped ndjson batches. Disk-backed WAL spool retries transient send failures; permanent 4xx batches are dropped with reason-labelled metrics so alerts stay actionable.

  • UA classifier with 38 bot families and 100+ patterns covering Russian/CIS, Asian, AI 2026, SEO, social link-previews, and archive crawlers. TikTok folded under bytespider for FCrDNS consistency on the receiver side.
  • Auto-detected field aliases from log_format — works with combined, combined_plus, key-value, logfmt, and JSON (escape=json) formats out of the box. $realip_remote_addr, $http_referrer typo, ref JSON key all resolve correctly. Override per field via [BotLogs.FieldAliases] TOML.
  • Hardened pusher: per-pass transient-failure cap so a poison batch can't stall the spool; atomic spool writes (tmp+rename) + orphan sweep on startup; symlink/foreign-UID rejection on spool replay; 0o700/0o600 perms so a local writer can't forge ingest events under the agent's Bearer token; HTTP error bodies sanitized of Bearer tokens and JSON token fields before they reach logs.
  • Graceful shutdown drains the queue in parallel with HTTP idle-conn shutdown under a 15 s budget; final batches go straight to the WAL on timeout instead of retry-chaining past App.Shutdown.
  • Event payload: per-event host (request Host header, normalized via net.SplitHostPort with bracketed-IPv6 support + 256-byte cap) shipped alongside serverName (vhost config), so receiver-side joins pick the right one. Event.URI carries the un-normalized URL with query string (capped at URITruncate=2048, IE-legacy URL limit covering >99% of legitimate traffic).
  • Operational metrics: topsrv_botlog_events_total{state,reason}, match_total{family}, send_errors_total{kind}, batch_duration_seconds, queue_depth, spool_files, spool_bytes plus a startup topsrv_collector_config_warnings_total{kind} counter that surfaces misconfig (no $http_user_agent in any tailed format, missing/truncated ExtractFields, high-cardinality labels) post-hoc rather than only in stdout.

Angie integration (internal/topsrv/angie)

The angie (nginx fork) JSON API gets a dedicated collector covering features nginx-free doesn't expose:

  • /status/ collector: topsrv_angie_up, per-zone request/response/SSL counters, per-upstream peer state (up/down/unavailable/recovering/busy) + per-peer health/keepalive, per-cache hit/stale/miss/expired/bypass, rate-limit (limit_conns/limit_reqs) results, shared-memory slab pages. Auto-discovery parses api /status/; and stub_status directives — API wins when both are present.
  • ACME collector (/status/http/acme_clients/?date=epoch): emits topsrv_angie_acme_state{name,state,certificate} and topsrv_angie_acme_next_run_seconds{name}. 404-silent on hosts without acme_client configured. Dashboards can alert on certificate!="valid", state="error", or stuck renewal.
  • SSL certificates: discovery scans acme_client directives and picks up /var/lib/angie/acme/<name>/certificate.pem for every client; rejects path-traversal in directive names. Files land in the existing topsrv_ssl_certificate_expiry_seconds metric — no new metric, no dashboard split. Multi-host (SAN) certs are no longer reduced to their CN: a new info metric topsrv_ssl_certificate_san_info{path,domain} enumerates every DNS name in CN ∪ SANs (dedup, CN first; convention follows blackbox_exporter's probe_ssl_last_chain_info).
  • Auto-discovery now prefers non-ssl listen: when a server block has both listen 80; and listen 443 ssl;, the api/stub_status URL is built against 80 (HTTP scrape against the SSL port returned HTTP 400 and broke both the api collector and the ACME collector that reuses the URL). When only an ssl listen exists, a warning is logged at startup so operators see the misconfig before scrape errors land.

Netstat listening ports + scope classification (internal/topsrv/netstat)

  • topsrv_netstat_listening_ports{port,proto,family,scope,process} — one series per active TCP LISTEN and bound UDP socket. UDP failures are non-fatal: a log line is emitted, TCP results still ship, so a kernel quirk on one protocol can't disable both.
  • Scope classification buckets the bind address into loopback / private (RFC1918, RFC4193 ULA, RFC6598 CGNAT, link-local) / public (routable or 0.0.0.0/:: wildcard — worst case).
  • Process attribution via a per-scrape PID cache; empty under kernel ACL (run agent as root for full visibility), with negative caching for missing PIDs.
  • Cardinality cap: 256 series per scrape so a compromised host with thousands of listeners cannot blow Prometheus budgets; single truncation warning when hit.
  • topsrv_netstat_tcp_connections gains a remote_scope label using the same classifier (LISTEN sockets carry remote_scope=none), so dashboards can alert on public-inbound (scan exposure) or private→public outbound (exfil signal) without per-IP cardinality.

🛠️ Reliability fixes

PostgreSQL collector — PG18 + schema probing (internal/topsrv/postgres)

  • Replace hardcoded schema assumptions with to_regclass probes via relHasColumn. Fixes 42P01 («relation does not exist») when switchToLargestDB lands in a database where pg_stat_statements is not installed.
  • Skip removed columns: pg_stat_wal.wal_write_time / wal_sync_time removed in PG18 (moved into pg_stat_io); fixes 42703 on every scrape. collectStatWAL builds SELECT-list and Scan args dynamically so a single code path serves PG14..PG18.
  • Drop total_time fallback in pg_stat_statements: column was renamed in PG13 / extension 1.8, so falling back to it triggered 42703 once the extension was actually loaded. collectStatements early-returns when statementsTimeCol is empty so unsupported installs stop spamming errors each scrape.
  • Introduce versionPG18 + relPgStat* name constants so a typo can't silently disable feature detection. Verified on PG15/PG17/PG18 in orbstack: zero ERROR lines.

PostgreSQL — fix slow leak in app-name cache

appNames accumulated (queryid, application_name) pairs forever on every sample tick: a process restart with a new pid/uuid in application_name minted a fresh entry; RSS grew ~70 MB/day on busy hosts.

  • Store last-seen time per pair (map[int64]map[string]time.Time); add pruneAppNames to evict pairs older than appNamesTTL=1h and drop newly-empty queryid sub-maps.
  • Run pruning on a separate 5-minute ticker; full walk of a 250k-entry map is ~14 ms, well below the existing 2 s sample budget.

Update supervisor — skip non-crash restarts in rollback check

  • Add Graceful flag to updateState; mark it on ctx.Done() so SIGTERM / manual restarts inside the post-update window no longer bump RestartCount and trip a false crash-loop rollback.
  • Run markStableAfter goroutine that zeros RestartCount once the new binary has been alive for 60 s, capping the brittle early-life window in which restarts can accumulate.
  • Guard the defer with markGracefulIfCancelled so a panic in Run cannot mask itself as a clean shutdown.
  • Extract attemptRollback so the success path is unit-testable without os.Exit; clear LastUpdate/RestartCount/Graceful after a successful rollback to stop the supervisor from re-tripping rollback against the same binary.

📦 Operator notes

  • Bot-logs is opt-in: set [BotLogs] Enabled = true and provide [Push].Endpoint. Without that, none of the new bot-log code runs and no extra log files are tailed.
  • The angie ssl-listen fix is silent for operators with already-working setups; only hosts where auto-discovery was previously picking the SSL port will see the metric come back online without config changes.
  • The new san_info metric increases push-payload size by ~N series per multi-host cert (typical: 3–5 SANs per cert). Negligible at any realistic fleet size.
  • No metric removals, no Prometheus label removals. All existing dashboards continue to work.

v0.0.21

17 Apr 16:37
66b355e

Choose a tag to compare

What's Changed

  • PostgreSQL bloat estimation — new metrics topsrv_pg_table_bloat_size_bytes, topsrv_pg_table_bloat_pct, topsrv_pg_index_bloat_size_bytes, topsrv_pg_index_bloat_pct (top 50 per type), refreshed every 15 min.

    • Uses the canonical ioguix heuristic (pg_class + pg_stats arithmetic) — catalog-only, never reads heap pages. Safe on multi-TB databases. No pgstattuple, no shared_buffers eviction.
    • Scans run in parallel via a stampede barrier (pool has 3 connections, the two catalogs don't contend).
    • Cache is replaced unconditionally on success so metrics clear after pg_repack / REINDEX. On total failure the refresh timestamp is reset so the next scrape retries immediately.
    • Measured cost on a prod-like cluster: ~108 ms warm (tables) + ~260 ms warm (indexes). Absorbed once per 15 min instead of every scrape.
    • Gatesrv dashboards consume these in existing Tables and Indexes tabs (no new tab).
  • Integration test stability — pre-generate the self-signed cert on the host (scripts/gen-test-certs.sh) and mount it into nginx + angie containers read-only. Previously apk add openssl inside the Alpine image took ~90 s and raced the 23 s healthcheck window, leaving nginx + angie unhealthy and four integration tests flaky. The host cert is reused for 365 days and regenerated automatically when under 7 days remain.

Changelog

  • 66b355e Add bloat estimation, stabilize integration certs (#3)

v0.0.20

17 Apr 10:27
64d300d

Choose a tag to compare

What's Changed

  • Major PostgreSQL collector overhaul — split the 1300-line postgres.go into a dedicated internal/topsrv/postgres/ subpackage (8 files + tests). Adds metrics for:

    • Wait events (sampled from pg_stat_activity) — pganalyze/APM-style «why is it slow right now»
    • pg_stat_archiver (topsrv_pg_archiver_total{result} + last_timestamp_seconds)
    • pg_stat_wal on PG14+ (wal_records_total, wal_fpi_total, wal_buffers_full_total, wal_io_time_seconds_total)
    • Locks extendedgranted label, topsrv_pg_blocked_backends, topsrv_pg_lock_wait_seconds_max
    • Replication stagestopsrv_pg_replication_lag_seconds{stage=write|flush|replay} + replication_sync_state
    • Indexes (top 50) — topsrv_pg_index_scans_total, topsrv_pg_index_size_bytes
    • Table maintenancelast_maintenance_timestamp_seconds{op=vacuum|analyze}, mod_since_analyze
    • Settings — curated GUCs normalised to bytes/seconds
    • Stats reset timestamps for detecting unexpected resets
    • Outlier-aware histogram for pg_stat_statements using min_exec_time/max_exec_time
    • application_name=topsrv on pool connections + auto-switch to the largest database for per-table views
    • Breaking: topsrv_pg_locks gains {granted}, topsrv_pg_replication_lag_seconds gains {stage}
  • Capture write-heavy queries in top-N selectionpg_stat_statements union now covers 5 dimensions (time, calls, blks_read, blks_dirtied, wal_bytes). Previously top-N skipped queries with few calls but lots of DML/WAL churn — now they show up in Queries/Statements tabs. Meta push to gatesrv also widens to top-30.

  • Lazy PostgreSQL init + retryNewCollector no longer opens connections at startup. The pool is created on first Collect() via ensureReady, so topsrv stays useful when PG is temporarily unreachable (boot-time ordering with systemd). topsrv_pg_up reports 0 until PG comes back; no restart needed.

    • pg_stat_activity sampler no longer permanently disables after a transient error — rate-limited logging, keeps retrying.
    • New [Postgres].Disabled = true config option skips PG monitoring even when auto-discovery finds a local process.
    • Better boundary handling in switchToLargestDB — new pool created before old one closes, so a failed reconnect no longer leaves a closed-pool reference.
  • Self-monitoring metrics — every registered collector is wrapped in an instrumentedCollector:

    • topsrv_collector_scrape_duration_seconds{collector} — spot slow collectors (alert >5s = monitoring is adding overhead)
    • topsrv_collector_scrape_panics_total{collector} — any non-zero rate = collector bug, page immediately
    • Panics in one collector no longer break the /metrics response
    • Also logs push: disabled when [Push].Endpoint is empty so operators know why metrics aren't leaving the box
  • Structured logging — migrated all Printf/Errorf calls to embedlog's slog-style Print(ctx, msg, key, val) / Error(ctx, msg, "error", err). Log output is now queryable key/value pairs — fits VM/ELK/Loki tagging workflows.

Changelog

  • 64d300d Document self-monitoring metrics and lazy PG init (#2)
  • cd214de Add lazy PG init, sampler retry, scrape metrics
  • 60c6213 Add write-heavy dimensions to pg_stat_statements top-N selection
  • 7e54c33 Refactor postgres collector into subpackage, expand P0/P1 metrics

v0.0.19

16 Apr 17:34

Choose a tag to compare

Changelog

  • 6345510 Improve pg_stat_statements top queries and reduce nginx URI cardinality
  • 1013c56 Limit process groups to top 100 by CPU+RSS, filter kernel threads

v0.0.18

16 Apr 08:46

Choose a tag to compare

What's Changed

  • Reduce nginx URI metrics cardinality — filter bot scanners and normalize PHP filenames:
    • Non-printable URIs (TLS garbage, SSH probes) → /:invalid
    • Scanner probes (.env, .git, .aws, .ssh, .svn, .bak, .sql) → /:bot-scanners
    • PHP filenames → :file.php (/shell.php/:file.php, /wiki/index.php/wiki/:file.php)
    • ~50% cardinality reduction on public-facing sites

v0.0.17

15 Apr 19:57

Choose a tag to compare

What's Changed

  • Network link speed metric — new topsrv_network_speed_bytes{interface} gauge reports physical NIC link speed in bytes/sec. Only physical NICs (detected via sysfs device symlink), virtual interfaces skipped. Enables utilization alerting: rate(topsrv_network_bytes_total[5m]) / topsrv_network_speed_bytes > 0.8
  • Export CGO_ENABLED=0 globally in Makefile

v0.0.16

15 Apr 18:10

Choose a tag to compare

Full Changelog: v0.0.15...v0.0.16