feat(observability): OTel traces/metrics + structured log-field canon (Coralogix)#224
Open
galt-tr wants to merge 9 commits into
Open
feat(observability): OTel traces/metrics + structured log-field canon (Coralogix)#224galt-tr wants to merge 9 commits into
galt-tr wants to merge 9 commits into
Conversation
Implements Phase 0+1 of docs/plans/otel-integration.md: a new telemetry package owns OTEL trace/metric provider lifecycle, config.TelemetryConfig wires an opt-in telemetry block (ARCADE_TELEMETRY_* / OTEL_EXPORTER_OTLP_* env), and the Prometheus contrib bridge feeds every existing arcade_* promauto metric onto the OTLP pipeline with zero call-site changes. With telemetry.enabled=false (the default), no providers are built and no network connections are opened, so runtime behaviour is unchanged. Also redirects stdlib log.* output through zap in cmd/arcade/main.go and logs app.Bootstrap failures before returning them. Claude-Session: https://claude.ai/code/session_01RjAobHaMvzZuzQF6E7dgBu
Wire otelgin into the api-server engine for inbound server spans (filtering /health, /ready, /metrics) and append telemetry.Fields to the request logger and panic recovery handler so log lines correlate with traces. Wrap the teranode, merkle-service, webhook callback, and DataHub HTTP clients' transports with otelhttp so outbound calls get client spans and W3C traceparent propagation, preserving each client's existing transport behaviour (broadcast transport, SSRF guard, default transport). Claude-Session: https://claude.ai/code/session_01RjAobHaMvzZuzQF6E7dgBu
Add kafka/otel.go (InjectTraceContext/ExtractTraceContext + headerCarrier) and wire it through the produce and consume paths: Producer.Send/SendAsync/ SendBatch/SendRaw now take ctx as their first parameter and forward it to the broker; saramaBroker injects W3C trace headers and wraps each produce in a "kafka.produce <topic>" span; memoryBroker.publish does the same so standalone mode round-trips traces too. The generic consume path (ConsumerGroup.processOne) extracts the trace context and wraps handler + DLQ routing in a "kafka.consume <topic>" span. The propagation dispatcher stashes each message's extracted producer SpanContext on propagationMsg (unexported, not serialized) and processBatch opens one "propagation.broadcast" span per flushed batch, linking up to 128 producer spans (OTel batch-messaging pattern) rather than one span per message. Every guard mirrors the same zero-cost-when-disabled shape used by the existing HTTP tracing: InjectTraceContext returns nil (no allocation) when ctx carries no valid span, and span creation is skipped entirely on that same check rather than relying only on the no-op tracer, cutting the disabled produce/consume path down to the pre-existing message-construction allocations (bench-verified, see otel_test.go/bench_test.go). Claude-Session: https://claude.ai/code/session_01RjAobHaMvzZuzQF6E7dgBu
Add the logfields package with canonical zap.Field constructors (TxID, TxIDs, TxIDCount, BlockHash, BlockHeight, SubtreeHash, CallbackURL, Status, Stage) plus TxIDBatch and ForEachTxIDChunk helpers for logging txid batches without blowing out a single log line. Migrate all ~47 existing identifier log sites across api_server, propagation, bump_builder, webhook, sse, watchdog, chaintracks_server, merkleservice, and events to the new constructors, fixing the one camelCase outlier (blockHash) and retiring the ad-hoc txid_total/txid_count/txids_sample variants in favor of txid_count/txids. This gives merkle-service and arcade a shared field vocabulary so a txid or block_hash search in Coralogix can correlate log lines across both services. enforce_test.go statically walks the repo via go/ast and fails the build if any zap.String/Strings/Int/Uint64 call outside this package uses one of the canonical (or retired) field-name literals directly, guarding against regression back to ad-hoc field names. Claude-Session: https://claude.ai/code/session_01RjAobHaMvzZuzQF6E7dgBu
…archability Close the transaction-lifecycle log gaps so a txid or block_hash search in Coralogix returns the full lifecycle of a transaction, not just the REJECTED terminal states that were already logged. Add Info-level lifecycle lines (via the logfields constructors) at every status transition that was previously silent: - RECEIVED (single and batch submit) in handleSubmitTransaction / handleSubmitTransactions - ACCEPTED_BY_NETWORK in applyTerminalStatuses - SEEN_ON_NETWORK / SEEN_MULTIPLE_NODES in applySeenCallback - MINED in markMinedAndPublish, chunked via logfields.ForEachTxIDChunk so every txid in a block surfaces somewhere in the log stream regardless of block size - PENDING_RETRY in requeueAfterDelay (once per call, not per message) - merkle-service registration in registerBatch, at Debug (not a status transition, keeps Info volume flat) - STUMP stored / BLOCK_PROCESSED enqueued in handleStump / handleBlockProcessed - webhook delivery success raised from Debug to Info (bounded by subscription count, not raw TPS) HTTP and Kafka-message handler entry points derive their logger via telemetry.LoggerWith(ctx, ...) so these lines carry trace_id/span_id when tracing is enabled. Batch-level operations that fan across many unrelated producer traces (propagator batch processing) intentionally keep the plain component logger — attaching one arbitrary trace_id to a multi-trace batch would be misleading. Every batch/chunk line uses TxIDBatch or ForEachTxIDChunk from the logfields package, so no line is unbounded and MINED lines cover every txid without capping. Claude-Session: https://claude.ai/code/session_01RjAobHaMvzZuzQF6E7dgBu
…fecycle coverage
Quality-review fixes for Task 4 (both commits above):
logfields / enforcement guard:
- Add "status" to enforce_test bannedKeys so a raw zap.String("status",...)
outside logfields is now caught (Status() is canonical, 7 sites).
- Extend the AST guard to also flag zap.Any/zap.Reflect (untyped escape
hatches) whose first arg is a banned key; document the exact constructor
coverage on zapFieldFuncs. Make the status/Any/Reflect probes permanent,
table-driven subtests in TestScanDetectsViolation.
- Rename two pre-existing HTTP-status-code sites from "status" to the
established "status_code" convention (api_server request logger,
bump/datahub fetch) — resolves the int-vs-string type collision the new
ban surfaced. Update the datahub test accordingly.
- Add logfields.Stage{Intake,Network,Cascade} constants; use them at all
Stage() call sites to prevent typos.
lifecycle coverage vs hot path:
- Batch RECEIVED (sync, pre-response, /txs up to 256MiB): switch from
unbounded ForEachTxIDChunk to a BOUNDED TxIDBatch line (true count +
capped list). Full per-txid coverage still arrives on the async ACCEPTED
line. Comment explains why.
- ACCEPTED_BY_NETWORK and SEEN (both async, off the request path): switch
from bounded TxIDBatch to full-coverage ForEachTxIDChunk, matching MINED.
- "processing batch": drop the 5-item preview that was shipping under the
canonical "txids" key (colliding with full-list semantics); txid_count
conveys size and the full list lives on the ACCEPTED line.
- Add comments at the webhook + propagator lifecycle lines that
intentionally skip telemetry.LoggerWith (no request ctx / batch spans
many unrelated producer traces).
- Strengthen tests: RECEIVED-batch proves bounded (1001 txs -> 1 line);
SEEN proves full chunked coverage (2300 txs -> 3 lines, union covers all);
ACCEPTED pins the chunked path via chunk_total.
Claude-Session: https://claude.ai/code/session_01RjAobHaMvzZuzQF6E7dgBu
…insecure on env endpoint Two carried-over fixes from the sibling merkle-service review, applied here before they hit production: 1. config.validate() only rejected an http(s):// scheme in telemetry.endpoint when protocol=grpc. The otlptracehttp/otlpmetrichttp exporters' WithEndpoint also want bare host:port, so an "http://collector:4318" under protocol=http passed validation and then failed silently in the exporter's background goroutine. Now rejected for both protocols, since the endpoint field is always host:port regardless of transport. 2. WithInsecure() was nested inside the `cfg.Endpoint != ""` branch in both the trace and metric exporter builders, so telemetry.insecure=true was silently dropped whenever the endpoint came from OTEL_EXPORTER_OTLP_ENDPOINT instead of cfg.Endpoint. Hoisted so it applies whenever cfg.Insecure is true, independent of which endpoint path is in play. Also guard the global propagator install behind at least one signal pipeline being built (enabled=true with traces=false and metrics=false now leaves the propagator untouched), matching the same review fix in merkle-service. Added a config test enumerating both protocols x both schemes for the endpoint-scheme rejection, and a telemetry test that forces a scheme-less env-supplied endpoint (so the OTLP library's own scheme-based auto-insecure detection can't mask the bug) to prove insecure=true is honored on that path. Claude-Session: https://claude.ai/code/session_01RjAobHaMvzZuzQF6E7dgBu
…ult)
Add HOST_IP/POD_NAME downward-API fields plus OTEL_EXPORTER_OTLP_ENDPOINT
and ARCADE_TELEMETRY_{ENABLED,INSECURE} to every mode's container env in
deploy/*.yaml (all, api-server, bump-builder, chaintracks, p2p-client,
propagation, sse, watchdog), pointing at a node-local collector reached
via the pod's host IP. HOST_IP is defined inline (not from a ConfigMap) and
precedes its $(HOST_IP) use, per the downward-API substitution requirement.
Telemetry stays off (ARCADE_TELEMETRY_ENABLED=false) so this is purely
additive plumbing — no behavior change until an operator flips the flag.
These repo manifests are reference/dev deployments; the authoritative
production deploy and OTLP wiring lives in the flux repo
(bsva-infra-flux#227).
Claude-Session: https://claude.ai/code/session_01RjAobHaMvzZuzQF6E7dgBu
…fecycle logs New docs/observability.md covers what actually shipped in the OTel/telemetry work: the OTLP pipeline (node-local collector, traces+metrics pushed, logs staying on stdout->filelog rather than OTLP), the full telemetry.*/OTEL_* config-env reference table, resource attributes (including arcade.mode), the snake_case log-field canon shared with merkle-service (txid/txids/txid_count/ block_hash/block_height/subtree_hash/callback_url/status/stage/trace_id/ span_id) and its static enforcement, the transaction-lifecycle log line inventory per status (bounded RECEIVED vs full-coverage chunked ACCEPTED/ SEEN/MINED, the Debug-level merkle-registration checkpoint, STUMP/ BLOCK_PROCESSED, webhook delivery), a note that IMMUTABLE and store-backed PENDING_RETRY have no active producer today, Coralogix search recipes, the dual Prometheus+OTLP metrics export, and trace/log correlation via trace_id. Cross-linked from README.md and metrics/README.md. Claude-Session: https://claude.ai/code/session_01RjAobHaMvzZuzQF6E7dgBu
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Goal
Ship Coralogix observability for arcade so a
txidorblock_hashissearchable end-to-end across the transaction lifecycle. Structured JSON
logging already existed; this canon-normalizes the log fields, adds
trace/span correlation, and layers OTel traces + metrics on top — all off
by default. Addresses #223. Implements the design in
docs/plans/otel-integration.md.What changed
telemetry/package + Prometheus→OTLP bridge: builds the shared OTELresource (
service.name,service.namespace,service.version,service.instance.id,arcade.mode), wires OTLP trace/metric exporters(gRPC or HTTP/protobuf), and bridges the existing
arcade_*Prometheusregistry onto the OTLP metric pipeline via
go.opentelemetry.io/contrib/bridges/prometheus— no metric name/labelchanges,
/metricskeeps scraping unchanged, both exporters runsimultaneously (dual-export, expected not a bug).
otelginon the inbound gin engine (outside panicrecovery, so a panic still lands on a live span);
otelhttptransports onevery outbound
http.Client(Teranode broadcast, merkle-service client,DataHub client, webhook delivery).
small
propagation.TextMapCarrieradapter over the existing header map —preserves the F-024 ("register with merkle-service before broadcast")
durability invariant, and is a genuine zero-allocation no-op when there is
no active span (telemetry disabled), verified by benchmark.
logfieldscanon package: typed constructors for everytransaction/block-lifecycle field (
txid,txids,txid_count,block_hash,block_height,subtree_hash,callback_url,status,stage) — the same field names merkle-service uses, so one Coralogixquery spans both services. A static AST test
(
logfields/enforce_test.go) fails the build if any of these field namesis ever constructed via a raw
zap.String/Strings/Int/Uint64/Any/Reflectcall outside the package, so the canon can't silently regress.
logged — bounded (
TxIDBatch, single line) for the sync pre-responseRECEIVED lines, full-coverage chunked (
ForEachTxIDChunk, guarantees everytxid appears somewhere in the stream) for the async ACCEPTED_BY_NETWORK /
SEEN_ON_NETWORK[_MULTIPLE_NODES] / MINED lines, plus REJECTED (intake /
network / cascade, tagged with
stage), a Debug-level merkle-registrationcheckpoint, STUMP/BLOCK_PROCESSED callback lines, and webhook-delivered.
trace_id/span_idride along wherever the log call has a live sampledspan in context (
telemetry.LoggerWith) — including MINED, whose Kafkaconsumer context is re-hydrated from the producer's injected trace
headers.
zap.RedirectStdLogso libraries logging throughthe standard library's package-global logger (e.g. gocore) also land as
structured JSON instead of leaking unstructured text to stderr.
here before they shipped):
config.validate()now rejects anhttp(s)://-scheme
telemetry.endpointfor both protocols, not justgrpc;
WithInsecure()is hoisted sotelemetry.insecure=trueapplieseven when the endpoint comes from
OTEL_EXPORTER_OTLP_ENDPOINTratherthan
telemetry.endpoint.deploy/*.yaml: every workload manifest (all, api-server,bump-builder, chaintracks, p2p-client, propagation, sse, watchdog) gets
HOST_IP/POD_NAMEdownward-API fields plusOTEL_EXPORTER_OTLP_ENDPOINT/ARCADE_TELEMETRY_{ENABLED,INSECURE}env,pointed at a node-local collector. These repo manifests are
reference/dev deployments — the authoritative production deploy and
OTLP wiring is bsva-infra-flux#227.
docs/observability.md(new): the OTLP pipeline, fulltelemetry.*/OTEL_*config-env reference, resource attributes, thefield canon and lifecycle-log inventory (including which two statuses —
IMMUTABLEand store-backedPENDING_RETRY— have no active producertoday and so are never logged as transitions), Coralogix search recipes,
and trace/log correlation. Cross-linked from
README.mdandmetrics/README.md.Off by default
ARCADE_TELEMETRY_ENABLED(defaultfalse) gates the entire OTel pipeline.With it unset/false,
telemetry.Initbuilds no providers, sets no OTELglobals, and opens no network connections — runtime behaviour is identical
to a build with no OTel support at all. The structured-logging /
logfieldscanon changes are unconditional (always-on, same as existingJSON logging) but are pure logging — no behavior change to any request
path. Kafka trace-header injection is a verified zero-allocation no-op when
disabled.
Companion PRs
Testing
go build ./...,go test ./... -count=1— all green.go test -raceontelemetry,kafka,services/propagation,services/api_server— all green.gofmt -l .— clean.golangci-lint run ./...— 0 issues.go mod tidy— no diff; all OTel core modules pinned tov1.44.0,contrib to
v0.69.0.schemes for the endpoint-scheme rejection; a telemetry test that forces a
scheme-less env-supplied endpoint (so the OTLP library's own
scheme-based auto-insecure detection can't mask the bug) to prove
insecure=trueis honored on that path; existing lifecycle-logging tests(bounded vs. chunked coverage, Debug-level registration, chunk
numbering) from the prior commits in this branch.
telemetry.logs=trueonly emits a one-time startupwarning; logs continue to ship via stdout JSON → filelog receiver, never
over OTLP.
https://claude.ai/code/session_01RjAobHaMvzZuzQF6E7dgBu