feat(ENGHLP-1235): add custom OpenTelemetry spans for ingest and NetBox client#546
feat(ENGHLP-1235): add custom OpenTelemetry spans for ingest and NetBox client#546jajeffries wants to merge 2 commits into
Conversation
…ox client Instrument Redis stream message handling with queue-lag attributes, wrap CreateIngestionLogs validation in a parent span, and trace rate-limiter waits before NetBox bulk-plan HTTP calls. Co-authored-by: Cursor <cursoragent@cursor.com>
Vulnerability Scan: Passed — diode-reconcilerImage: No vulnerabilities found. Commit: 4c59aeb |
Vulnerability Scan: Passed — diode-authImage:
Commit: 4c59aeb |
Vulnerability Scan: Passed — diode-ingesterImage: No vulnerabilities found. Commit: 4c59aeb |
|
Go test coverage
Total coverage: 52.8% |
Upgrade x/net (and transitive x/crypto, x/sys, x/sync, x/text) to resolve HIGH severity CVEs flagged by the container vulnerability scan. Co-authored-by: Cursor <cursoragent@cursor.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: af16a525bb
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if err != nil { | ||
| errs = append(errs, fmt.Errorf("failed to convert ingestion timestamp: %v", err)) | ||
| } else { | ||
| streamLag := time.Now().Unix() - int64(ingestionTs) |
There was a problem hiding this comment.
Use nanoseconds when calculating stream lag
In production the ingester writes ingestion_ts as time.Now().UnixNano() (diode-server/ingester/component.go:184), so this calculation mixes seconds from time.Now().Unix() with a nanosecond timestamp. Every normally produced message will report a huge negative stream_lag value instead of queue wait time, making the new span attribute unusable for incident debugging; convert the stored value with time.Unix(0, int64(ingestionTs)) or compare nanoseconds consistently.
Useful? React with 👍 / 👎.
|
|
||
| func (c *Client) waitForRateLimit(ctx context.Context) (err error) { | ||
| ctx, span := telemetry.StartSpan(ctx, telemetry.SpanRateLimiterWait) | ||
| defer telemetry.End(span, err) |
There was a problem hiding this comment.
Defer rate-limit span errors with a closure
Because arguments to a deferred call are evaluated immediately, err is still nil when this defer is registered. If c.limiter.Wait(ctx) returns an error because the request context is canceled or times out while waiting, the client correctly returns that error but the new rate_limiter.wait span is ended as successful, hiding exactly the wait failures this span is meant to expose; wrap the call in defer func() { telemetry.End(span, err) }() as done in the ingestion spans.
Useful? React with 👍 / 👎.
Summary
Adds the upstream custom spans required by diode-pro#318 for ENGHLP-1235 incident debugging: async ingest stream handling, validation-phase work, and explicit rate-limiter wait time before NetBox HTTP calls.
What changed
diode-server/telemetry/tracing.go—StartSpan/Endhelpers and span name constants aligned withdiode-pro/docs/observability/tracing.md.reconciler/ingestion_processor.goingestion.handle_stream_message— parent span withstream_lag,request_id,entity_countingestion.create_ingestion_logs— validation/hashing phase before bulk DB insertnetboxdiodeplugin/client.go—rate_limiter.waitspan viawaitForRateLimit()before BulkPlan, BulkPlanApply, and GetDefaultBranch callsHow tested
make fix-lintgo test ./reconciler/... ./netboxdiodeplugin/... ./telemetry/...Follow-up
After merge, bump
github.com/netboxlabs/diode/diode-serverindiode-pro/server/go.mod(companion PR: https://github.com/netboxlabs/diode-pro/pull/318).Linear
ENGHLP-1235 — https://linear.app/netboxlabs/issue/ENGHLP-1235/add-tracing-to-help-understand-request-waits
Made with Cursor