Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 13 additions & 24 deletions docs/metrics.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# Metrics reference

For example PromQL queries and alert recipes, see [promql-recipes.md](promql-recipes.md).

## System (`topsrv_cpu_*`, `topsrv_memory_*`, `topsrv_load_*`, `topsrv_swap_*`)

| Metric | Type | Labels | Description |
Expand Down Expand Up @@ -57,30 +59,6 @@
| `topsrv_netstat_udp_sndbuf_errors_total` | counter | — | UDP send buffer errors |
| `topsrv_netstat_ip_unknown_protos_total` | counter | — | IP datagrams with unknown protocol |

Useful queries:

```promql
# Inventory of publicly-reachable listeners across the fleet
topsrv_netstat_listening_ports{scope="public"}

# Same, UDP only — DNS / mDNS / WireGuard / unexpected UDP services
topsrv_netstat_listening_ports{proto="udp", scope="public"}

# Alert: a new public-bound port appeared in the last 10 minutes
# (compare current set against the set seen 10 minutes ago — non-zero rows are new exposures)
(topsrv_netstat_listening_ports{scope="public"} == 1)
unless on (instance, proto, port, family) (topsrv_netstat_listening_ports{scope="public"} offset 10m == 1)

# Count of distinct public-bound ports per host (capacity / drift watch)
count by (instance) (topsrv_netstat_listening_ports{scope="public"} == 1)

# Established TCP connections from the public internet — scan / unexpected exposure signal
topsrv_netstat_tcp_connections{state="ESTABLISHED", direction="inbound", remote_scope="public"}

# Outbound TCP to the public internet — exfil / unexpected egress watch
sum by (instance) (topsrv_netstat_tcp_connections{direction="outbound", remote_scope="public"})
```

## Process (`topsrv_process_*`)

| Metric | Type | Labels | Description |
Expand Down Expand Up @@ -292,6 +270,17 @@ Metrics from Angie JSON API (`/status/`). Requires `api /status/;` directive in
|--------|------|--------|-------------|
| `topsrv_angie_slab_pages` | gauge | zone, state | Slab pages (used/free) |

### ACME clients

Polled from `/status/http/acme_clients/?date=epoch` on angie 1.5+ when the `acme_client` directive is configured. Endpoint absent (404) silently — no metrics emitted on hosts without ACME. Per [angie http_acme docs](https://en.angie.software/angie/docs/configuration/modules/http/http_acme/) `state` ∈ {`ready`, `requesting`, `disabled`, `failed`}; `certificate` ∈ {`valid`, `expired`, `missing`, `mismatch`, `error`}.

These metrics expose the ACME state machine. **Certificate expiry (NotAfter) for ACME-managed certs is covered by the same `topsrv_ssl_certificate_expiry_seconds` metric as static certs** — discovery scans `/var/lib/angie/acme/<client>/certificate.pem` for every `acme_client` directive and includes the file in the SSL collector. ACME and pinned certs end up indistinguishable in that metric (operator-controllable filter via `path` label).

| Metric | Type | Labels | Description |
|--------|------|--------|-------------|
| `topsrv_angie_acme_state` | gauge | name, state, certificate | `value=1` per active acme_client tuple. `name` is the directive's client name. `state` is the operator-facing state-machine value (`ready`, `renewing`, `error`, …); `certificate` is the cert validity (`valid`, `invalid`, `pending`, …). Exact enum surface evolves with angie releases — label values pass through verbatim. Alert on `certificate!="valid"` or `state="error"` |
| `topsrv_angie_acme_next_run_seconds` | gauge | name | Unix timestamp of the next scheduled action (renewal attempt / state-machine tick) for this client |

## Bot-logs (`topsrv_botlog_*`)

Opt-in. Emitted when `[BotLogs].Enabled = true`. The agent matches every parsed nginx access-log line against a built-in UA fingerprint table (38 families) and ships matched events as gzipped ndjson to the topsrv.io `/v1/bot-logs` endpoint, with disk-backed WAL spool for retry on transient send failures.
Expand Down
44 changes: 44 additions & 0 deletions docs/promql-recipes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# PromQL recipes

Common queries, alerts, and dashboard expressions for topsrv metrics. Indexed by collector. See [metrics.md](metrics.md) for the underlying metric definitions.

## Netstat — listening ports

```promql
# Inventory of publicly-reachable listeners across the fleet
topsrv_netstat_listening_ports{scope="public"}

# Same, UDP only — DNS / mDNS / WireGuard / unexpected UDP services
topsrv_netstat_listening_ports{proto="udp", scope="public"}

# Alert: a new public-bound port appeared in the last 10 minutes
# (compare current set against the set seen 10 minutes ago — non-zero rows are new exposures)
(topsrv_netstat_listening_ports{scope="public"} == 1)
unless on (instance, proto, port, family) (topsrv_netstat_listening_ports{scope="public"} offset 10m == 1)

# Count of distinct public-bound ports per host (capacity / drift watch)
count by (instance) (topsrv_netstat_listening_ports{scope="public"} == 1)
```

## Netstat — TCP connection scope

```promql
# Established TCP connections from the public internet — scan / unexpected exposure signal
topsrv_netstat_tcp_connections{state="ESTABLISHED", direction="inbound", remote_scope="public"}

# Outbound TCP to the public internet — exfil / unexpected egress watch
sum by (instance) (topsrv_netstat_tcp_connections{direction="outbound", remote_scope="public"})
```

## Angie ACME clients

```promql
# Inventory of ACME clients per host and their current state
topsrv_angie_acme_state

# Alert: any client not in cert=valid state
topsrv_angie_acme_state{certificate!="valid"}

# Time until next ACME action (seconds; negative = overdue / stuck)
topsrv_angie_acme_next_run_seconds - time()
```
19 changes: 19 additions & 0 deletions internal/app/app.go
Original file line number Diff line number Diff line change
Expand Up @@ -727,6 +727,17 @@ func (a *App) registerAngie(ctx context.Context, services []topsrv.Service) {
a.addCollector(nginx.NewSSLCollector(a.Logger, discovered.SSLCertificates))
a.Print(ctx, "angie: monitoring SSL certificates", "count", len(discovered.SSLCertificates))
}
// Operator-facing diagnostic: acme_client directives configured but
// no certs at the default state-path most likely means a custom
// acme_client_path build option we don't auto-discover. Expiry
// metric will be missing for those certs until pinned via static
// ssl_certificate /full/path.
if discovered.ACMEDirectives > discovered.ACMECertsFound {
a.Print(ctx, "angie: ACME certs missing at default state path",
"directives", discovered.ACMEDirectives,
"found_on_disk", discovered.ACMECertsFound,
"default_path", "/var/lib/angie/acme/<name>/certificate.pem")
}

a.Print(ctx, "angie: auto-discovered access logs", "count", len(angieCfg.AccessLogs), "config", svc.ConfigPath)
} else {
Expand All @@ -740,6 +751,14 @@ func (a *App) registerAngie(ctx context.Context, services []topsrv.Service) {
// Register API collector (preferred) or stub_status fallback.
if angieCfg.StatusURL != "" {
a.addCollector(angie.NewAPICollector(a.Logger, angieCfg.StatusURL))
// ACME collector polls /status/http/acme_clients/?date=epoch on the
// same angie process; it's a no-op (404) when acme_client isn't
// configured, so it's safe to register unconditionally.
if acme, err := angie.NewACMECollector(a.Logger, angieCfg.StatusURL); err == nil {
a.addCollector(acme)
} else {
a.Error(ctx, "angie: ACME collector init failed", "error", err)
}
} else if angieCfg.StubStatusURL != "" {
a.addCollector(nginx.NewStubCollector(a.Logger, angieCfg.StubStatusURL))
}
Expand Down
124 changes: 124 additions & 0 deletions internal/topsrv/angie/acme.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
package angie

import (
"context"
"encoding/json"
"net/http"
"net/url"
"strings"
"time"

"github.com/vmkteam/topsrv/internal/topsrv"

"github.com/prometheus/client_golang/prometheus"
"github.com/vmkteam/embedlog"
)

var _ topsrv.Collector = (*ACMECollector)(nil)

// ACMEClient mirrors one entry of /status/http/acme_clients/ (angie 1.5+).
// state/certificate are operator-facing enums whose set evolves across
// releases — we surface them as Prometheus labels instead of mapping to
// numbers. "details" is a free-form sentence intended for humans; not
// useful as a label (high cardinality, churns with translations) so we
// don't decode it. next_run is int64 because we ask for date=epoch.
type ACMEClient struct {
State string `json:"state"`
Certificate string `json:"certificate"`
NextRun int64 `json:"next_run"`
}

// ACMECollector polls angie's per-client ACME status so dashboards can see
// when a cert is renewing / failed / due for next attempt without reading
// the PEM off disk (which only tells you NotAfter, not "renewal blocked").
type ACMECollector struct {
embedlog.Logger

url string
client *http.Client

state *prometheus.Desc
nextRun *prometheus.Desc
}

// NewACMECollector extends APICollector's statusURL with /http/acme_clients/?date=epoch.
// date=epoch turns next_run into a plain Unix int so we skip ISO 8601 parsing.
func NewACMECollector(logger embedlog.Logger, statusURL string) (*ACMECollector, error) {
u, err := url.Parse(statusURL)
if err != nil {
return nil, err
}
const suffix = "/http/acme_clients/"
base := strings.TrimSuffix(u.Path, "/")
if !strings.HasSuffix(base, strings.TrimSuffix(suffix, "/")) {
base += suffix
} else {
base += "/"
}
u.Path = base
u.RawQuery = "date=epoch"
return &ACMECollector{
Logger: logger,
url: u.String(),
client: &http.Client{Timeout: 5 * time.Second},
state: prometheus.NewDesc(
"topsrv_angie_acme_state",
"ACME client state (value=1 per name/state/certificate tuple).",
[]string{"name", "state", "certificate"}, nil,
),
nextRun: prometheus.NewDesc(
"topsrv_angie_acme_next_run_seconds",
"Unix timestamp of next ACME action per client.",
[]string{"name"}, nil,
),
}, nil
}

func (c *ACMECollector) Name() string { return "angie-acme" }

func (c *ACMECollector) Describe(ch chan<- *prometheus.Desc) {
ch <- c.state
ch <- c.nextRun
}

func (c *ACMECollector) Collect(ch chan<- prometheus.Metric) {
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()

req, err := http.NewRequestWithContext(ctx, http.MethodGet, c.url, nil)
if err != nil {
return
}
resp, err := c.client.Do(req)
if err != nil {
c.Error(ctx, "angie-acme: request failed", "error", err)
return
}
defer resp.Body.Close()

// 404 means acme_client isn't configured on this host — silently skip
// so we don't spam logs every scrape. Other non-200 status are real
// problems worth surfacing.
if resp.StatusCode == http.StatusNotFound {
return
}
if resp.StatusCode != http.StatusOK {
c.Error(ctx, "angie-acme: API returned error status", "status", resp.StatusCode)
return
}

var clients map[string]ACMEClient
if err := json.NewDecoder(resp.Body).Decode(&clients); err != nil {
c.Error(ctx, "angie-acme: failed to decode response", "error", err)
return
}

for name, cl := range clients {
ch <- prometheus.MustNewConstMetric(c.state, prometheus.GaugeValue, 1, name, cl.State, cl.Certificate)
// next_run is omitted by angie when state ∈ {disabled, requesting} —
// it decodes as 0; skip emitting so dashboards see "no scheduled action".
if cl.NextRun > 0 {
ch <- prometheus.MustNewConstMetric(c.nextRun, prometheus.GaugeValue, float64(cl.NextRun), name)
}
}
}
Loading
Loading