Skip to content

Cross-service trace correlation during investigation #9

@nomadicmehul

Description

@nomadicmehul

Summary

When investigating an incident, the agent should follow distributed traces across services to identify where failures originate vs. where they manifest.

Why This Matters

A 5xx error on the API gateway might be caused by a database connection pool exhaustion 3 hops away. Without trace following, the agent only sees the symptom, not the cause.

Acceptance Criteria

  • Given an error in service-A, fetch the trace ID from logs/observability
  • Follow the trace across all services it touched
  • Identify the first failure point in the trace (origin of the error)
  • Include trace evidence in RCA: "Error originated in service-D (DB timeout), propagated through C → B → A"
  • Works with Grafana (Tempo), Datadog APM, or Cloud Trace

Dependencies

  • Requires Phase 1 observability integrations

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:observabilityObservability tool integrationsarea:rcaRoot cause analysisphase:2-smart-triagePhase 2 — Smart Triage & Investigationpriority:highImportant, do if time allowstype:featureNew feature or capability

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions