Skip to content

feat(telemetry): opt-in anonymous ops telemetry per command#411

Open
PeterGuy326 wants to merge 13 commits into
mainfrom
feat/telemetry
Open

feat(telemetry): opt-in anonymous ops telemetry per command#411
PeterGuy326 wants to merge 13 commits into
mainfrom
feat/telemetry

Conversation

@PeterGuy326
Copy link
Copy Markdown
Collaborator

What

Adds an opt-in, anonymous operational telemetry stream: one dimensions-only
metric per dws command invocation (error rate, latency, command distribution,
version/platform health). This is the ops-monitoring counterpart to the audit
trail, kept deliberately small and independent of it.

  • New package internal/telemetry: a 12-field Event (no content, no identity)
    and an env-driven Forwarder.
  • Hooked into executeInvocation's existing defer, reusing the
    outcome/err_class/duration already computed for command-end logging.
  • Off by default. Requires both DWS_TELEMETRY_ENABLED=true and
    DWS_TELEMETRY_URL. Optional DWS_TELEMETRY_TOKEN, DWS_TELEMETRY_TIMEOUT_MS.
  • Reference ingest under docs/telemetry/fc-sls-ingest/ (FC Web Function → SLS)
    with a dry-run mode and a zero-dependency local sink for testing without SLS.

Privacy boundary

Collects only coarse dimensions: command/subcommand/outcome/err_class/exit_code/duration_ms/cli_version/channel/os/corp_id/trace_id.
Never object names, free text, peer ids, device fingerprints, or request bodies.
A test asserts command-argument content never leaks into the payload.

Testing

go test ./internal/telemetry/... ./internal/app/ -run Telemetry
bash scripts/dev/telemetry_smoke.sh    # end-to-end, no SLS/cloud/login

Note for reviewers / merge order

Touches internal/app/runner.go's command-end defer — the same 2-line region as
the audit PR (#398). The two additions (emitAudit and emitTelemetry) are
independent and compose; whichever lands second resolves a trivial 2-line merge.

Emit one dimensions-only metric per dws invocation (error rate, latency,
command distribution, version/platform health) to an operator-configured
sink. Independent of the audit machinery and OFF by default.

- internal/telemetry: Event (10 coarse dimensions, no content/identity)
  + env-driven Forwarder (DWS_TELEMETRY_ENABLED/URL/TOKEN/TIMEOUT_MS)
- wire emitTelemetry into executeInvocation's defer, reusing the existing
  outcome/err_class/duration already computed for command-end logging
- docs/telemetry.md: fields, privacy boundary, SLS ingest + 4 alert rules
- tests cover enable gating, POST contract, and the privacy boundary
  (param content must never leak into the payload)
A minimal Flask web service to deploy as a Function Compute Web Function:
verifies the bearer token, then writes each telemetry Event to an SLS
Logstore via PutLogs (SLS cannot accept the raw signed-less POST directly).
Promotes the query dimensions to their own columns and keeps the full event
verbatim. Includes deploy walkthrough, local smoke test, and 4 alert rules.
…te boundary

- localsink.py: stdlib-only HTTP collector to test the full dws->HTTP
  pipeline without SLS or Function Compute
- telemetry.md: local-test walkthrough (incl. a mini local dashboard) and a
  section spelling out that the SLS project / endpoint / token live in the
  deployer's own infra and never enter this open-source repo
app.py now auto-detects mode: with no SLS_* env (or TELEMETRY_DRYRUN=true) it
logs each event to stdout and returns 204 instead of writing to SLS, and the
aliyun-log SDK is imported lazily so dry-run needs no extra dependency. Lets you
deploy to Function Compute and confirm the client->FC pipeline end-to-end before
provisioning any SLS resource. GET / reports the active mode. README documents
the deploy-then-wire-SLS flow.
scripts/dev/telemetry_smoke.sh builds dws, starts the zero-dep local sink,
fires --mock commands and asserts the pipeline: events received with all
expected dimensions, bearer token enforced (401), and the privacy boundary
(a sentinel command argument must never appear in any payload). Exits non-zero
on failure, so it can gate pre-push / CI.
Open-source repo convention: configmeta descriptions, docs/telemetry.md and the
FC ingest README are now English (code comments were already English). No
behavior change; tests still pass.
Convert all Chinese content in the telemetry surface to English so the public
repo leaves no localized traces:
- docs/telemetry.md (full doc)
- docs/telemetry/fc-sls-ingest/README.md (FC->SLS receiver guide)
- internal/telemetry/telemetry.go (config item descriptions)
- internal/app/telemetry_runtime_test.go (test fixture string)

No behavior change; English-only wording.
…pt-out + disclosure

Lets a downstream "fleet" distribution ship telemetry on-by-default to its own
ingest, while the open-source build stays opt-in and off — and hardcodes no
endpoint.

- internal/telemetry/telemetry.go:
  - build-time vars defaultURL/defaultToken (empty in OSS; injected via -ldflags
    by a downstream build)
  - Enabled() posture: DWS_TELEMETRY_DISABLED hard opt-out wins; explicit
    DWS_TELEMETRY_ENABLED overrides; otherwise on only when a default endpoint is
    baked in. Env URL/token override the build defaults.
  - ShowNoticeOnce(): one-time stderr disclosure (marker ~/.dws/.telemetry_notice_shown)
  - new DWS_TELEMETRY_DISABLED env + configmeta registration
- internal/app/telemetry_runtime.go: print the disclosure once when telemetry first activates
- internal/telemetry/telemetry_test.go: cover baked-in default-on + opt-out (OSS opt-in cases unchanged)
- docs/telemetry.md: document default posture, ldflags injection, opt-out, disclosure

Verified e2e: a build with a baked-in endpoint and no env defaults on, prints the
notice once, and reports; DWS_TELEMETRY_DISABLED=true suppresses it.
…erless Devs)

Make the FC->SLS reference receiver deployable without hand-steps:
- Dockerfile: container image (AONE / any container platform), gunicorn on :9000,
  app.py auto-detects dry-run vs SLS mode from env. Built + run + received a live
  event locally.
- s.yaml + deploy.sh: Serverless Devs spec for public Aliyun FC (s build && s deploy).
- .dockerignore: keep the image to app.py + requirements.txt.

No behavior change to the receiver; packaging only.
…ver-less monitoring

Add a zero-infra sink: when DWS_TELEMETRY_FILE is set, each event is appended as
one JSON line to that local file instead of being POSTed — no receiver, no FC, no
SLS. Ideal for local/per-machine stability monitoring; aggregate the file with a
small script (see docs/telemetry.md). File sink takes precedence over URL and,
when set, enables telemetry (with the same DWS_TELEMETRY_DISABLED opt-out).

- telemetry.go: EnvFile + resolvedFile (with ~ expansion); Enabled() counts a file
  sink as a destination; Forwarder.File appends JSONL in Emit.
- test: file sink enables + appends valid JSON lines + opt-out still wins.
- docs: "Local monitoring (lightest)" section + one-line aggregation.
…PPEND)

Make the zero-dep local sink usable as a tiny central collector on your own
machine — e.g. "monitor on my computer" for a small team, no SLS/FC needed:
- HOST env (default 127.0.0.1; set 0.0.0.0 to accept POSTs from LAN machines)
- APPEND env (default truncate for tests; APPEND=1 keeps history across restarts)
- startup banner shows the real bind host + append mode

Token auth is strongly advised when binding 0.0.0.0. Verified: a dws pointed at
the machine's LAN IP lands events in the collector file.
…deploy artifacts + gofmt), take their smoke script
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant