Skip to content

feat(observability): OTel traces/metrics + structured log-field canon (Coralogix)#224

Open
galt-tr wants to merge 9 commits into
mainfrom
feat/otel-coralogix
Open

feat(observability): OTel traces/metrics + structured log-field canon (Coralogix)#224
galt-tr wants to merge 9 commits into
mainfrom
feat/otel-coralogix

Conversation

@galt-tr

@galt-tr galt-tr commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Goal

Ship Coralogix observability for arcade so a txid or block_hash is
searchable end-to-end across the transaction lifecycle. Structured JSON
logging already existed; this canon-normalizes the log fields, adds
trace/span correlation, and layers OTel traces + metrics on top — all off
by default. Addresses #223. Implements the design in
docs/plans/otel-integration.md.

What changed

  • telemetry/ package + Prometheus→OTLP bridge: builds the shared OTEL
    resource (service.name, service.namespace, service.version,
    service.instance.id, arcade.mode), wires OTLP trace/metric exporters
    (gRPC or HTTP/protobuf), and bridges the existing arcade_* Prometheus
    registry onto the OTLP metric pipeline via
    go.opentelemetry.io/contrib/bridges/prometheus — no metric name/label
    changes, /metrics keeps scraping unchanged, both exporters run
    simultaneously (dual-export, expected not a bug).
  • HTTP tracing: otelgin on the inbound gin engine (outside panic
    recovery, so a panic still lands on a live span); otelhttp transports on
    every outbound http.Client (Teranode broadcast, merkle-service client,
    DataHub client, webhook delivery).
  • Kafka trace propagation: inject on produce / extract on consume via a
    small propagation.TextMapCarrier adapter over the existing header map —
    preserves the F-024 ("register with merkle-service before broadcast")
    durability invariant, and is a genuine zero-allocation no-op when there is
    no active span (telemetry disabled), verified by benchmark.
  • logfields canon package: typed constructors for every
    transaction/block-lifecycle field (txid, txids, txid_count,
    block_hash, block_height, subtree_hash, callback_url, status,
    stage) — the same field names merkle-service uses, so one Coralogix
    query spans both services. A static AST test
    (logfields/enforce_test.go) fails the build if any of these field names
    is ever constructed via a raw zap.String/Strings/Int/Uint64/Any/Reflect
    call outside the package, so the canon can't silently regress.
  • Full transaction-lifecycle logging: every status transition is now
    logged — bounded (TxIDBatch, single line) for the sync pre-response
    RECEIVED lines, full-coverage chunked (ForEachTxIDChunk, guarantees every
    txid appears somewhere in the stream) for the async ACCEPTED_BY_NETWORK /
    SEEN_ON_NETWORK[_MULTIPLE_NODES] / MINED lines, plus REJECTED (intake /
    network / cascade, tagged with stage), a Debug-level merkle-registration
    checkpoint, STUMP/BLOCK_PROCESSED callback lines, and webhook-delivered.
    trace_id/span_id ride along wherever the log call has a live sampled
    span in context (telemetry.LoggerWith) — including MINED, whose Kafka
    consumer context is re-hydrated from the producer's injected trace
    headers.
  • stdlib-log capture: zap.RedirectStdLog so libraries logging through
    the standard library's package-global logger (e.g. gocore) also land as
    structured JSON instead of leaking unstructured text to stderr.
  • Two carried-over fixes (caught in merkle-service's review, applied
    here before they shipped): config.validate() now rejects an
    http(s)://-scheme telemetry.endpoint for both protocols, not just
    grpc; WithInsecure() is hoisted so telemetry.insecure=true applies
    even when the endpoint comes from OTEL_EXPORTER_OTLP_ENDPOINT rather
    than telemetry.endpoint.
  • deploy/*.yaml: every workload manifest (all, api-server,
    bump-builder, chaintracks, p2p-client, propagation, sse, watchdog) gets
    HOST_IP/POD_NAME downward-API fields plus
    OTEL_EXPORTER_OTLP_ENDPOINT/ARCADE_TELEMETRY_{ENABLED,INSECURE} env,
    pointed at a node-local collector. These repo manifests are
    reference/dev deployments
    — the authoritative production deploy and
    OTLP wiring is bsva-infra-flux#227.
  • docs/observability.md (new): the OTLP pipeline, full
    telemetry.*/OTEL_* config-env reference, resource attributes, the
    field canon and lifecycle-log inventory (including which two statuses —
    IMMUTABLE and store-backed PENDING_RETRY — have no active producer
    today and so are never logged as transitions), Coralogix search recipes,
    and trace/log correlation. Cross-linked from README.md and
    metrics/README.md.

Off by default

ARCADE_TELEMETRY_ENABLED (default false) gates the entire OTel pipeline.
With it unset/false, telemetry.Init builds no providers, sets no OTEL
globals, and opens no network connections — runtime behaviour is identical
to a build with no OTel support at all. The structured-logging /
logfields canon changes are unconditional (always-on, same as existing
JSON logging) but are pure logging — no behavior change to any request
path. Kafka trace-header injection is a verified zero-allocation no-op when
disabled.

Companion PRs

  • merkle-service: #170 — the mirror-image OTel/logging work on that service, sharing the same field canon.
  • flux (production deploy wiring + scoped JSON log parsing): bsva-infra-flux#227.

Testing

  • go build ./..., go test ./... -count=1 — all green.
  • go test -race on telemetry, kafka, services/propagation,
    services/api_server — all green.
  • gofmt -l . — clean. golangci-lint run ./... — 0 issues.
  • go mod tidy — no diff; all OTel core modules pinned to v1.44.0,
    contrib to v0.69.0.
  • New regression tests: config test enumerating both protocols × both
    schemes for the endpoint-scheme rejection; a telemetry test that forces a
    scheme-less env-supplied endpoint (so the OTLP library's own
    scheme-based auto-insecure detection can't mask the bug) to prove
    insecure=true is honored on that path; existing lifecycle-logging tests
    (bounded vs. chunked coverage, Debug-level registration, chunk
    numbering) from the prior commits in this branch.
  • OTLP logs stay off: telemetry.logs=true only emits a one-time startup
    warning; logs continue to ship via stdout JSON → filelog receiver, never
    over OTLP.

https://claude.ai/code/session_01RjAobHaMvzZuzQF6E7dgBu

galt-tr added 9 commits July 2, 2026 11:40
Implements Phase 0+1 of docs/plans/otel-integration.md: a new telemetry
package owns OTEL trace/metric provider lifecycle, config.TelemetryConfig
wires an opt-in telemetry block (ARCADE_TELEMETRY_* / OTEL_EXPORTER_OTLP_*
env), and the Prometheus contrib bridge feeds every existing arcade_*
promauto metric onto the OTLP pipeline with zero call-site changes. With
telemetry.enabled=false (the default), no providers are built and no
network connections are opened, so runtime behaviour is unchanged.

Also redirects stdlib log.* output through zap in cmd/arcade/main.go and
logs app.Bootstrap failures before returning them.

Claude-Session: https://claude.ai/code/session_01RjAobHaMvzZuzQF6E7dgBu
Wire otelgin into the api-server engine for inbound server spans
(filtering /health, /ready, /metrics) and append telemetry.Fields to the
request logger and panic recovery handler so log lines correlate with
traces. Wrap the teranode, merkle-service, webhook callback, and DataHub
HTTP clients' transports with otelhttp so outbound calls get client spans
and W3C traceparent propagation, preserving each client's existing
transport behaviour (broadcast transport, SSRF guard, default transport).

Claude-Session: https://claude.ai/code/session_01RjAobHaMvzZuzQF6E7dgBu
Add kafka/otel.go (InjectTraceContext/ExtractTraceContext + headerCarrier)
and wire it through the produce and consume paths: Producer.Send/SendAsync/
SendBatch/SendRaw now take ctx as their first parameter and forward it to
the broker; saramaBroker injects W3C trace headers and wraps each produce in
a "kafka.produce <topic>" span; memoryBroker.publish does the same so
standalone mode round-trips traces too. The generic consume path
(ConsumerGroup.processOne) extracts the trace context and wraps handler +
DLQ routing in a "kafka.consume <topic>" span. The propagation dispatcher
stashes each message's extracted producer SpanContext on propagationMsg
(unexported, not serialized) and processBatch opens one
"propagation.broadcast" span per flushed batch, linking up to 128 producer
spans (OTel batch-messaging pattern) rather than one span per message.

Every guard mirrors the same zero-cost-when-disabled shape used by the
existing HTTP tracing: InjectTraceContext returns nil (no allocation) when
ctx carries no valid span, and span creation is skipped entirely on that
same check rather than relying only on the no-op tracer, cutting the
disabled produce/consume path down to the pre-existing message-construction
allocations (bench-verified, see otel_test.go/bench_test.go).

Claude-Session: https://claude.ai/code/session_01RjAobHaMvzZuzQF6E7dgBu
Add the logfields package with canonical zap.Field constructors (TxID,
TxIDs, TxIDCount, BlockHash, BlockHeight, SubtreeHash, CallbackURL,
Status, Stage) plus TxIDBatch and ForEachTxIDChunk helpers for logging
txid batches without blowing out a single log line. Migrate all ~47
existing identifier log sites across api_server, propagation,
bump_builder, webhook, sse, watchdog, chaintracks_server, merkleservice,
and events to the new constructors, fixing the one camelCase outlier
(blockHash) and retiring the ad-hoc txid_total/txid_count/txids_sample
variants in favor of txid_count/txids. This gives merkle-service and
arcade a shared field vocabulary so a txid or block_hash search in
Coralogix can correlate log lines across both services.

enforce_test.go statically walks the repo via go/ast and fails the
build if any zap.String/Strings/Int/Uint64 call outside this package
uses one of the canonical (or retired) field-name literals directly,
guarding against regression back to ad-hoc field names.

Claude-Session: https://claude.ai/code/session_01RjAobHaMvzZuzQF6E7dgBu
…archability

Close the transaction-lifecycle log gaps so a txid or block_hash search
in Coralogix returns the full lifecycle of a transaction, not just the
REJECTED terminal states that were already logged. Add Info-level
lifecycle lines (via the logfields constructors) at every status
transition that was previously silent:

- RECEIVED (single and batch submit) in handleSubmitTransaction /
  handleSubmitTransactions
- ACCEPTED_BY_NETWORK in applyTerminalStatuses
- SEEN_ON_NETWORK / SEEN_MULTIPLE_NODES in applySeenCallback
- MINED in markMinedAndPublish, chunked via logfields.ForEachTxIDChunk
  so every txid in a block surfaces somewhere in the log stream
  regardless of block size
- PENDING_RETRY in requeueAfterDelay (once per call, not per message)
- merkle-service registration in registerBatch, at Debug (not a status
  transition, keeps Info volume flat)
- STUMP stored / BLOCK_PROCESSED enqueued in handleStump /
  handleBlockProcessed
- webhook delivery success raised from Debug to Info (bounded by
  subscription count, not raw TPS)

HTTP and Kafka-message handler entry points derive their logger via
telemetry.LoggerWith(ctx, ...) so these lines carry trace_id/span_id
when tracing is enabled. Batch-level operations that fan across many
unrelated producer traces (propagator batch processing) intentionally
keep the plain component logger — attaching one arbitrary trace_id to
a multi-trace batch would be misleading.

Every batch/chunk line uses TxIDBatch or ForEachTxIDChunk from the
logfields package, so no line is unbounded and MINED lines cover every
txid without capping.

Claude-Session: https://claude.ai/code/session_01RjAobHaMvzZuzQF6E7dgBu
…fecycle coverage

Quality-review fixes for Task 4 (both commits above):

logfields / enforcement guard:
- Add "status" to enforce_test bannedKeys so a raw zap.String("status",...)
  outside logfields is now caught (Status() is canonical, 7 sites).
- Extend the AST guard to also flag zap.Any/zap.Reflect (untyped escape
  hatches) whose first arg is a banned key; document the exact constructor
  coverage on zapFieldFuncs. Make the status/Any/Reflect probes permanent,
  table-driven subtests in TestScanDetectsViolation.
- Rename two pre-existing HTTP-status-code sites from "status" to the
  established "status_code" convention (api_server request logger,
  bump/datahub fetch) — resolves the int-vs-string type collision the new
  ban surfaced. Update the datahub test accordingly.
- Add logfields.Stage{Intake,Network,Cascade} constants; use them at all
  Stage() call sites to prevent typos.

lifecycle coverage vs hot path:
- Batch RECEIVED (sync, pre-response, /txs up to 256MiB): switch from
  unbounded ForEachTxIDChunk to a BOUNDED TxIDBatch line (true count +
  capped list). Full per-txid coverage still arrives on the async ACCEPTED
  line. Comment explains why.
- ACCEPTED_BY_NETWORK and SEEN (both async, off the request path): switch
  from bounded TxIDBatch to full-coverage ForEachTxIDChunk, matching MINED.
- "processing batch": drop the 5-item preview that was shipping under the
  canonical "txids" key (colliding with full-list semantics); txid_count
  conveys size and the full list lives on the ACCEPTED line.
- Add comments at the webhook + propagator lifecycle lines that
  intentionally skip telemetry.LoggerWith (no request ctx / batch spans
  many unrelated producer traces).
- Strengthen tests: RECEIVED-batch proves bounded (1001 txs -> 1 line);
  SEEN proves full chunked coverage (2300 txs -> 3 lines, union covers all);
  ACCEPTED pins the chunked path via chunk_total.

Claude-Session: https://claude.ai/code/session_01RjAobHaMvzZuzQF6E7dgBu
…insecure on env endpoint

Two carried-over fixes from the sibling merkle-service review, applied here
before they hit production:

1. config.validate() only rejected an http(s):// scheme in telemetry.endpoint
   when protocol=grpc. The otlptracehttp/otlpmetrichttp exporters' WithEndpoint
   also want bare host:port, so an "http://collector:4318" under protocol=http
   passed validation and then failed silently in the exporter's background
   goroutine. Now rejected for both protocols, since the endpoint field is
   always host:port regardless of transport.

2. WithInsecure() was nested inside the `cfg.Endpoint != ""` branch in both
   the trace and metric exporter builders, so telemetry.insecure=true was
   silently dropped whenever the endpoint came from OTEL_EXPORTER_OTLP_ENDPOINT
   instead of cfg.Endpoint. Hoisted so it applies whenever cfg.Insecure is
   true, independent of which endpoint path is in play.

Also guard the global propagator install behind at least one signal pipeline
being built (enabled=true with traces=false and metrics=false now leaves the
propagator untouched), matching the same review fix in merkle-service.

Added a config test enumerating both protocols x both schemes for the
endpoint-scheme rejection, and a telemetry test that forces a scheme-less
env-supplied endpoint (so the OTLP library's own scheme-based auto-insecure
detection can't mask the bug) to prove insecure=true is honored on that path.

Claude-Session: https://claude.ai/code/session_01RjAobHaMvzZuzQF6E7dgBu
…ult)

Add HOST_IP/POD_NAME downward-API fields plus OTEL_EXPORTER_OTLP_ENDPOINT
and ARCADE_TELEMETRY_{ENABLED,INSECURE} to every mode's container env in
deploy/*.yaml (all, api-server, bump-builder, chaintracks, p2p-client,
propagation, sse, watchdog), pointing at a node-local collector reached
via the pod's host IP. HOST_IP is defined inline (not from a ConfigMap) and
precedes its $(HOST_IP) use, per the downward-API substitution requirement.

Telemetry stays off (ARCADE_TELEMETRY_ENABLED=false) so this is purely
additive plumbing — no behavior change until an operator flips the flag.

These repo manifests are reference/dev deployments; the authoritative
production deploy and OTLP wiring lives in the flux repo
(bsva-infra-flux#227).

Claude-Session: https://claude.ai/code/session_01RjAobHaMvzZuzQF6E7dgBu
…fecycle logs

New docs/observability.md covers what actually shipped in the OTel/telemetry
work: the OTLP pipeline (node-local collector, traces+metrics pushed, logs
staying on stdout->filelog rather than OTLP), the full telemetry.*/OTEL_*
config-env reference table, resource attributes (including arcade.mode), the
snake_case log-field canon shared with merkle-service (txid/txids/txid_count/
block_hash/block_height/subtree_hash/callback_url/status/stage/trace_id/
span_id) and its static enforcement, the transaction-lifecycle log line
inventory per status (bounded RECEIVED vs full-coverage chunked ACCEPTED/
SEEN/MINED, the Debug-level merkle-registration checkpoint, STUMP/
BLOCK_PROCESSED, webhook delivery), a note that IMMUTABLE and store-backed
PENDING_RETRY have no active producer today, Coralogix search recipes, the
dual Prometheus+OTLP metrics export, and trace/log correlation via trace_id.

Cross-linked from README.md and metrics/README.md.

Claude-Session: https://claude.ai/code/session_01RjAobHaMvzZuzQF6E7dgBu
@galt-tr galt-tr requested a review from mrz1836 as a code owner July 2, 2026 18:01
@github-actions github-actions Bot added the feature Any new significant addition label Jul 2, 2026
@github-actions github-actions Bot added the size/XL Very large change (>500 lines) label Jul 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature Any new significant addition size/XL Very large change (>500 lines)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants