Skip to content

feat(telemetry): opt-in anonymous ops telemetry (per-command), with FC→SLS reference receiver#410

Closed
PeterGuy326 wants to merge 13 commits into
DingTalk-Real-AI:mainfrom
PeterGuy326:feat/telemetry
Closed

feat(telemetry): opt-in anonymous ops telemetry (per-command), with FC→SLS reference receiver#410
PeterGuy326 wants to merge 13 commits into
DingTalk-Real-AI:mainfrom
PeterGuy326:feat/telemetry

Conversation

@PeterGuy326
Copy link
Copy Markdown
Collaborator

Summary

Adds opt-in, anonymous, dimensions-only ops telemetry: dws can emit one JSON
metric per command invocation to a deployer-configured endpoint, for monitoring
error rate, latency, command distribution, and version/platform health.

Off by default — with DWS_TELEMETRY_ENABLED unset, nothing is emitted (zero
hot-path impact). Centralized reporting is opt-in + explicitly disclosed.

What's included

  • internal/telemetry/ — event schema + emitter (timeout-bounded; never gates the command)
  • internal/app/telemetry_runtime.go — runtime wiring into command execution
  • docs/telemetry.md — full operator doc (enabling, fields, receiver contract, local testing, SLS wiring, alerts)
  • docs/telemetry/fc-sls-ingest/ — reference FC→SLS receiver (app.py Flask, localsink.py zero-dep local sink) with a dry-run mode to validate the pipeline before touching SLS

Privacy boundary

  • Collects coarse dimensions only: command/subcommand/outcome/exit_code/duration_ms/cli_version/channel/os/corp_id/trace_id.
  • Never collects user identity, object names/ids, free text, device fingerprints, or request/response bodies.
  • Endpoint URL/token are read from env vars at runtime; no vendor address is ever hardcoded in the repo.

Config (all env, all default off)

Var Purpose
DWS_TELEMETRY_ENABLED enable (requires URL too)
DWS_TELEMETRY_URL ingest endpoint
DWS_TELEMETRY_TOKEN optional bearer
DWS_TELEMETRY_TIMEOUT_MS per-report timeout (default 1500)

Notes for reviewers

  • All docs/strings are English-only.
  • Suggest squash-merge to keep main history clean (the branch's intermediate commits predate the English pass).

Emit one dimensions-only metric per dws invocation (error rate, latency,
command distribution, version/platform health) to an operator-configured
sink. Independent of the audit machinery and OFF by default.

- internal/telemetry: Event (10 coarse dimensions, no content/identity)
  + env-driven Forwarder (DWS_TELEMETRY_ENABLED/URL/TOKEN/TIMEOUT_MS)
- wire emitTelemetry into executeInvocation's defer, reusing the existing
  outcome/err_class/duration already computed for command-end logging
- docs/telemetry.md: fields, privacy boundary, SLS ingest + 4 alert rules
- tests cover enable gating, POST contract, and the privacy boundary
  (param content must never leak into the payload)
A minimal Flask web service to deploy as a Function Compute Web Function:
verifies the bearer token, then writes each telemetry Event to an SLS
Logstore via PutLogs (SLS cannot accept the raw signed-less POST directly).
Promotes the query dimensions to their own columns and keeps the full event
verbatim. Includes deploy walkthrough, local smoke test, and 4 alert rules.
…te boundary

- localsink.py: stdlib-only HTTP collector to test the full dws->HTTP
  pipeline without SLS or Function Compute
- telemetry.md: local-test walkthrough (incl. a mini local dashboard) and a
  section spelling out that the SLS project / endpoint / token live in the
  deployer's own infra and never enter this open-source repo
app.py now auto-detects mode: with no SLS_* env (or TELEMETRY_DRYRUN=true) it
logs each event to stdout and returns 204 instead of writing to SLS, and the
aliyun-log SDK is imported lazily so dry-run needs no extra dependency. Lets you
deploy to Function Compute and confirm the client->FC pipeline end-to-end before
provisioning any SLS resource. GET / reports the active mode. README documents
the deploy-then-wire-SLS flow.
scripts/dev/telemetry_smoke.sh builds dws, starts the zero-dep local sink,
fires --mock commands and asserts the pipeline: events received with all
expected dimensions, bearer token enforced (401), and the privacy boundary
(a sentinel command argument must never appear in any payload). Exits non-zero
on failure, so it can gate pre-push / CI.
Open-source repo convention: configmeta descriptions, docs/telemetry.md and the
FC ingest README are now English (code comments were already English). No
behavior change; tests still pass.
Convert all Chinese content in the telemetry surface to English so the public
repo leaves no localized traces:
- docs/telemetry.md (full doc)
- docs/telemetry/fc-sls-ingest/README.md (FC->SLS receiver guide)
- internal/telemetry/telemetry.go (config item descriptions)
- internal/app/telemetry_runtime_test.go (test fixture string)

No behavior change; English-only wording.
…pt-out + disclosure

Lets a downstream "fleet" distribution ship telemetry on-by-default to its own
ingest, while the open-source build stays opt-in and off — and hardcodes no
endpoint.

- internal/telemetry/telemetry.go:
  - build-time vars defaultURL/defaultToken (empty in OSS; injected via -ldflags
    by a downstream build)
  - Enabled() posture: DWS_TELEMETRY_DISABLED hard opt-out wins; explicit
    DWS_TELEMETRY_ENABLED overrides; otherwise on only when a default endpoint is
    baked in. Env URL/token override the build defaults.
  - ShowNoticeOnce(): one-time stderr disclosure (marker ~/.dws/.telemetry_notice_shown)
  - new DWS_TELEMETRY_DISABLED env + configmeta registration
- internal/app/telemetry_runtime.go: print the disclosure once when telemetry first activates
- internal/telemetry/telemetry_test.go: cover baked-in default-on + opt-out (OSS opt-in cases unchanged)
- docs/telemetry.md: document default posture, ldflags injection, opt-out, disclosure

Verified e2e: a build with a baked-in endpoint and no env defaults on, prints the
notice once, and reports; DWS_TELEMETRY_DISABLED=true suppresses it.
…erless Devs)

Make the FC->SLS reference receiver deployable without hand-steps:
- Dockerfile: container image (AONE / any container platform), gunicorn on :9000,
  app.py auto-detects dry-run vs SLS mode from env. Built + run + received a live
  event locally.
- s.yaml + deploy.sh: Serverless Devs spec for public Aliyun FC (s build && s deploy).
- .dockerignore: keep the image to app.py + requirements.txt.

No behavior change to the receiver; packaging only.
…ver-less monitoring

Add a zero-infra sink: when DWS_TELEMETRY_FILE is set, each event is appended as
one JSON line to that local file instead of being POSTed — no receiver, no FC, no
SLS. Ideal for local/per-machine stability monitoring; aggregate the file with a
small script (see docs/telemetry.md). File sink takes precedence over URL and,
when set, enables telemetry (with the same DWS_TELEMETRY_DISABLED opt-out).

- telemetry.go: EnvFile + resolvedFile (with ~ expansion); Enabled() counts a file
  sink as a destination; Forwarder.File appends JSONL in Emit.
- test: file sink enables + appends valid JSON lines + opt-out still wins.
- docs: "Local monitoring (lightest)" section + one-line aggregation.
…PPEND)

Make the zero-dep local sink usable as a tiny central collector on your own
machine — e.g. "monitor on my computer" for a small team, no SLS/FC needed:
- HOST env (default 127.0.0.1; set 0.0.0.0 to accept POSTs from LAN machines)
- APPEND env (default truncate for tests; APPEND=1 keeps history across restarts)
- startup banner shows the real bind host + append mode

Token auth is strongly advised when binding 0.0.0.0. Verified: a dws pointed at
the machine's LAN IP lands events in the collector file.
…deploy artifacts + gofmt), take their smoke script
@PeterGuy326
Copy link
Copy Markdown
Collaborator Author

Consolidated into #411 (origin/feat/telemetry). #411 now has the full superset: English + gofmt + default-on (ldflags) + local file sink + deploy artifacts + the smoke script. Closing this fork-based duplicate.

@PeterGuy326 PeterGuy326 closed this Jun 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant