Skip to content

Fix observability metrics and error handling in Graph API client#458

Merged
jfrench9 merged 3 commits intomainfrom
bugfix/graph-api-query-o11y-fix
Mar 11, 2026
Merged

Fix observability metrics and error handling in Graph API client#458
jfrench9 merged 3 commits intomainfrom
bugfix/graph-api-query-o11y-fix

Conversation

@jfrench9
Copy link
Member

@jfrench9 jfrench9 commented Mar 11, 2026

Summary

This PR enhances the observability and error handling in the GraphClient.stream_chunks method, improving the reliability and debuggability of Graph API query operations.

Key Accomplishments

Graph API Client (robosystems/graph_api/client/client.py)

  • Improved error handling: Refactored the stream_chunks method to provide more granular and robust error handling, ensuring failures are properly caught, categorized, and surfaced
  • Enhanced metrics tracking: Added comprehensive metrics instrumentation around the streaming flow, enabling better observability into query performance, error rates, and operational health
  • Structured error propagation: Errors are now tracked with richer context, making it easier to diagnose issues in production through distributed tracing and metric dashboards

OTel Metrics (robosystems/middleware/otel/metrics.py)

  • Extended metrics definitions: Added new metric instruments or labels to support the enhanced tracking in the Graph API client, ensuring consistent metric semantics across the platform

Breaking Changes

None. The changes are additive improvements to error handling and metrics instrumentation. Existing API contracts and method signatures are preserved.

Testing Notes

  • Verify that the stream_chunks method correctly emits metrics on both successful and failed query operations
  • Confirm that error scenarios (e.g., network failures, malformed responses, timeouts) are properly captured in metrics and do not result in silent failures or unhandled exceptions
  • Validate that OpenTelemetry metric exports reflect the new/updated metric dimensions
  • Regression test existing Graph API query flows to ensure no behavioral changes in the happy path

Infrastructure Considerations

  • Ensure that the OpenTelemetry collector and any downstream metrics backends (e.g., Prometheus, Grafana) are configured to handle the new metric names or label cardinality introduced by this change
  • Monitor dashboards may need to be updated to visualize the newly emitted metrics for full observability coverage
  • No deployment configuration changes are expected; the changes are purely at the application instrumentation layer

🤖 Generated with Claude Code

Branch Info:

  • Source: bugfix/graph-api-query-o11y-fix
  • Target: main
  • Type: bugfix

Co-Authored-By: Claude noreply@anthropic.com

…nks method

- Improved error handling to classify response errors as client, server, or timeout.
- Added metrics for time-to-first-byte and error types during streaming operations.
- Updated NDJSON line parsing with enhanced logging for JSON decode errors.
…hunks method

- Simplified error handling by removing unnecessary variables for stream status and error type.
- Enhanced metrics tracking to ensure accurate recording of errors and response statuses.
- Improved clarity in exception handling for timeout and transient errors.
@jfrench9 jfrench9 merged commit 3e7a0be into main Mar 11, 2026
7 checks passed
@jfrench9 jfrench9 deleted the bugfix/graph-api-query-o11y-fix branch March 11, 2026 05:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant