fix: prevent gunicorn worker recycling from corrupting histogram aggregation#1068
Open
fix: prevent gunicorn worker recycling from corrupting histogram aggregation#1068
Conversation
Contributor
Reviewer's GuideAdjusts the OTel metrics aggregation pipeline to convert per‑worker cumulative counters to deltas before aggregation, while also adding a full dev-container image, CI workflow, and Alcove automation for dependency upgrade workflows and local development. Sequence diagram for updated OTel metrics aggregation pipelinesequenceDiagram
participant GunicornWorker as Gunicorn_worker
participant OTelReceiver as OTel_metrics_receiver
participant CumulativeToDelta as cumulativetodelta
participant RemoveWorkerName as attributes_remove_worker_name
participant Batch as batch_api_aggregation
participant GroupByAttrs as groupbyattrs_api_aggregation
participant DeltaToCumulative as deltatorumulative
participant Prometheus as prometheus_exporter
GunicornWorker->>OTelReceiver: emit api.request_duration (cumulative, per worker)
OTelReceiver->>CumulativeToDelta: api.request_duration (cumulative)
CumulativeToDelta-->>OTelReceiver: api.request_duration (delta, per worker)
OTelReceiver->>RemoveWorkerName: api.request_duration (delta, per worker)
RemoveWorkerName-->>Batch: api.request_duration (delta, worker.name removed)
Batch-->>GroupByAttrs: api.request_duration (delta, grouped by api attrs)
GroupByAttrs-->>DeltaToCumulative: api.request_duration (aggregated delta)
DeltaToCumulative-->>Prometheus: api.request_duration (aggregated cumulative)
Note over GunicornWorker,CumulativeToDelta: Worker recycle resets its local counter to 0
Note over CumulativeToDelta,GroupByAttrs: Reset becomes 0-delta, preventing negative aggregate
Note over DeltaToCumulative,Prometheus: Prometheus sees a clean cumulative counter stream
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
Contributor
There was a problem hiding this comment.
Hey - I've found 1 issue, and left some high level feedback:
- In the
metrics/aggregationpipeline,deltatorumulativeis currently unscoped and will affect all metrics; consider adding anincludefilter similar tocumulativetodeltaso onlyapi.request_durationis converted back to cumulative. - The dev container Dockerfile applies patches but only logs a warning on failure; if these patches are required for correctness, it would be safer to fail the build (or gate it behind an explicit opt-out) instead of proceeding with a partially patched environment.
- In
dev-container/entrypoint.sh, thepip install -e /workspace/pulp-service/pulp_service || truehides installation errors for the editable checkout; consider surfacing or failing on these errors so a misconfigured or missing workspace repo doesn’t silently degrade the dev environment.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In the `metrics/aggregation` pipeline, `deltatorumulative` is currently unscoped and will affect all metrics; consider adding an `include` filter similar to `cumulativetodelta` so only `api.request_duration` is converted back to cumulative.
- The dev container Dockerfile applies patches but only logs a warning on failure; if these patches are required for correctness, it would be safer to fail the build (or gate it behind an explicit opt-out) instead of proceeding with a partially patched environment.
- In `dev-container/entrypoint.sh`, the `pip install -e /workspace/pulp-service/pulp_service || true` hides installation errors for the editable checkout; consider surfacing or failing on these errors so a misconfigured or missing workspace repo doesn’t silently degrade the dev environment.
## Individual Comments
### Comment 1
<location path="deploy/clowdapp.yaml" line_range="52-56" />
<code_context>
+ - api.request_duration
+ match_type: strict
+
+ deltatorumulative:
+
batch:
</code_context>
<issue_to_address>
**issue (bug_risk):** Processor name `deltatorumulative` looks like a typo and may not match the actual OpenTelemetry processor name.
This name doesn’t match the usual OTEL `deltatocumulative` processor and may prevent the processor from being instantiated, breaking delta→cumulative conversion for these metrics. Please verify the intended processor and update both its definition and references to use the correct name.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
5c996e0 to
d57cf4c
Compare
Member
Author
|
/retest |
1 similar comment
Member
Author
|
/retest |
…egation Gunicorn worker recycling causes in-memory Prometheus counters to reset. The OTel aggregation pipeline strips worker.name and sums all workers into a single cumulative counter via groupbyattrs. When a worker recycles, its counter resets to 0, decreasing the aggregate. This manifests as a "hidden counter reset" in Prometheus: if the recycled worker's final le=+Inf value coincidentally equals the new worker's starting value (e.g. both are 1 because the new worker immediately handled a slow request), Prometheus does not detect the reset for le=+Inf. But le=1000 resets visibly. This inflates rate(le=1000) relative to rate(le=+Inf), producing SLI ratios greater than 1. Fix: insert cumulativetodelta before worker aggregation so we sum per-worker deltas (always non-negative) instead of cumulative totals. Worker recycles produce a 0-delta rather than a negative value that corrupts the aggregate. Add deltatorumulative after groupbyattrs to convert the aggregate delta back to a cumulative counter for the Prometheus exporter. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
d57cf4c to
7e1f670
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Gunicorn worker recycling (
--max-requests) resets in-memory counters to 0. The OTel pipeline stripsworker.nameand sums all workers into a single cumulative counter viagroupbyattrs. When a worker recycles, the aggregate can decrease.This causes a "hidden counter reset": if the recycled worker's final
le=+Infbucket value coincidentally equals the new worker's starting value (e.g. both are1because the new worker immediately handled a slow request before the first scrape), Prometheus does not detect the reset forle=+Inf. Butle=1000resets visibly (new worker starts at 0, no fast requests yet). This inflatesrate(le=1000)relative torate(le=+Inf), producing SLI latency ratios greater than 1.Fix
Set
OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE=deltaon pulp-api, pulp-content, and pulp-worker. With delta temporality, the SDK exports only the change since the last interval rather than a running total. When a worker recycles, its first export is a small non-negative delta — not a drop from a large cumulative value. Thegroupbyattrsaggregation then sums non-negative per-worker deltas, and the Prometheus exporter accumulates them into a monotonically increasing cumulative counter.This approach requires no new OTel collector processors and preserves the existing cardinality reduction from
attributes/remove_worker_name+groupbyattrs/api_aggregation.Verification
Tested in an ephemeral environment. The OTel collector starts cleanly and
pulp_api_request_duration_milliseconds_bucketis exported without aworker_namelabel, confirming the aggregation pipeline is functioning correctly.Test plan
+Inf:🤖 Generated with Claude Code