fix: prevent gunicorn worker recycling from corrupting histogram aggregation by dkliban · Pull Request #1068 · pulp/pulp-service

dkliban · 2026-04-20T15:56:09Z

Problem

Gunicorn worker recycling (--max-requests) resets in-memory counters to 0. The OTel pipeline strips worker.name and sums all workers into a single cumulative counter via groupbyattrs. When a worker recycles, the aggregate can decrease.

This causes a "hidden counter reset": if the recycled worker's final le=+Inf bucket value coincidentally equals the new worker's starting value (e.g. both are 1 because the new worker immediately handled a slow request before the first scrape), Prometheus does not detect the reset for le=+Inf. But le=1000 resets visibly (new worker starts at 0, no fast requests yet). This inflates rate(le=1000) relative to rate(le=+Inf), producing SLI latency ratios greater than 1.

Fix

Set OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE=delta on pulp-api, pulp-content, and pulp-worker. With delta temporality, the SDK exports only the change since the last interval rather than a running total. When a worker recycles, its first export is a small non-negative delta — not a drop from a large cumulative value. The groupbyattrs aggregation then sums non-negative per-worker deltas, and the Prometheus exporter accumulates them into a monotonically increasing cumulative counter.

This approach requires no new OTel collector processors and preserves the existing cardinality reduction from attributes/remove_worker_name + groupbyattrs/api_aggregation.

Verification

Tested in an ephemeral environment. The OTel collector starts cleanly and pulp_api_request_duration_milliseconds_bucket is exported without a worker_name label, confirming the aggregation pipeline is functioning correctly.

Test plan

Deploy to stage

Confirm the SLI query returns ≤ 1 with a 1h window:

sum(rate(pulp_api_request_duration_milliseconds_bucket{le="1000",exported_job="pulp-api",http_method=~"PUT|POST|PATCH",http_target!~".*/upload/.*"}[1h]))
/
sum(rate(pulp_api_request_duration_milliseconds_bucket{le="+Inf",exported_job="pulp-api",http_method=~"PUT|POST|PATCH",http_target!~".*/upload/.*"}[1h]))

Confirm no per-series ratios return +Inf:

rate(pulp_api_request_duration_milliseconds_bucket{le="1000",...}[1h])
/ ignoring(le)
rate(pulp_api_request_duration_milliseconds_bucket{le="+Inf",...}[1h])

Confirm raw counter sums remain monotonically increasing

🤖 Generated with Claude Code

sourcery-ai · 2026-04-20T15:56:16Z

Reviewer's Guide

Adjusts the OTel metrics aggregation pipeline to convert per‑worker cumulative counters to deltas before aggregation, while also adding a full dev-container image, CI workflow, and Alcove automation for dependency upgrade workflows and local development.

Sequence diagram for updated OTel metrics aggregation pipeline

sequenceDiagram
    participant GunicornWorker as Gunicorn_worker
    participant OTelReceiver as OTel_metrics_receiver
    participant CumulativeToDelta as cumulativetodelta
    participant RemoveWorkerName as attributes_remove_worker_name
    participant Batch as batch_api_aggregation
    participant GroupByAttrs as groupbyattrs_api_aggregation
    participant DeltaToCumulative as deltatorumulative
    participant Prometheus as prometheus_exporter

    GunicornWorker->>OTelReceiver: emit api.request_duration (cumulative, per worker)
    OTelReceiver->>CumulativeToDelta: api.request_duration (cumulative)
    CumulativeToDelta-->>OTelReceiver: api.request_duration (delta, per worker)

    OTelReceiver->>RemoveWorkerName: api.request_duration (delta, per worker)
    RemoveWorkerName-->>Batch: api.request_duration (delta, worker.name removed)
    Batch-->>GroupByAttrs: api.request_duration (delta, grouped by api attrs)

    GroupByAttrs-->>DeltaToCumulative: api.request_duration (aggregated delta)
    DeltaToCumulative-->>Prometheus: api.request_duration (aggregated cumulative)

    Note over GunicornWorker,CumulativeToDelta: Worker recycle resets its local counter to 0
    Note over CumulativeToDelta,GroupByAttrs: Reset becomes 0-delta, preventing negative aggregate
    Note over DeltaToCumulative,Prometheus: Prometheus sees a clean cumulative counter stream

File-Level Changes

Change	Details	Files
Fix histogram aggregation under Gunicorn worker recycling by changing the OTel metrics pipeline to aggregate deltas instead of cumulative counters and re-expose as cumulative to Prometheus.	Define a cumulativetodelta processor that targets the api.request_duration metric with strict matching. Insert cumulativetodelta into the metrics/aggregation pipeline before removing worker.name to convert per-worker cumulative counters to deltas. Insert deltatorumulative after groupbyattrs/api_aggregation to turn the aggregated delta stream back into a cumulative counter for export. Ensure the modified pipeline still uses the existing memory_limiter, filter, batch, and groupby processors in the correct order.	`deploy/clowdapp.yaml`
Introduce a dedicated dev container image to run pulp-service and its dependencies locally with supervisord-managed services.	Create a dev-container Dockerfile that installs Python, PostgreSQL 16, Redis, pulp-service, and support tooling into a venv, applies bundled patches, and configures a non-production Pulp setup. Initialize and configure PostgreSQL and Redis inside the image, including trust-based pg_hba.conf, database/user creation, and data directories. Wire in existing image assets (scripts, middleware, route helpers, patches) and collect static assets as the pulp user. Expose Pulp API and content ports and declare /workspace as a shared volume for editable checkouts.	`dev-container/Dockerfile`
Add a supervised runtime entrypoint that bootstraps the dev environment and manages core services.	Implement an entrypoint script that starts PostgreSQL and Redis, creates the pulp database/user if missing, and runs Django migrations. Optionally install pulp-service from /workspace/pulp-service in editable mode when present to support live code reloading. Reset the admin password to a configurable default and then stop the ad-hoc DB/Redis processes so supervisord can take over. Start supervisord using a custom config that manages PostgreSQL, Redis, and the Pulp API/content/worker processes with logs under /var/log/pulp.	`dev-container/entrypoint.sh` `dev-container/supervisord.conf`
Document and automate usage of the dev container and add an Alcove-based dependency upgrade workflow.	Add .CLAUDE.md describing how to run and use the hosted-pulp-dev-env image, manage services, apply patches, run tests, and interact with the database. Define an Alcove agent (upgrade-deps/AGENT.md) with a multi-phase procedure for upgrading pulpcore/plugin dependencies, handling patches, migrations, and tests, then opening PRs. Add an Alcove task and workflow for scheduling and executing the dependency upgrade pipeline using the dev container image. Provide a dev-focused Django settings module that connects to local PostgreSQL/Redis and enables domain support with token auth disabled.	`.CLAUDE.md` `.alcove/agents/upgrade-deps/AGENT.md` `.alcove/tasks/upgrade-deps.yml` `.alcove/workflows/upgrade-deps-pipeline.yml` `dev-container/settings.py`
Set up CI to build and publish the dev container image to GitHub Container Registry on each branch push.	Add a GitHub Actions workflow that builds dev-container/Dockerfile using buildx on every branch push. Sanitize the branch name by replacing slashes with dashes and use it as the Docker tag. Log in to GHCR with GITHUB_TOKEN and push images to ghcr.io//hosted-pulp-dev-env:. Enable GHA cache for faster subsequent builds.	`.github/workflows/build-dev-container.yml`
Integrate helper scripts for patch management, service restart, and test execution in the dev container.	Introduce placeholder or new scripts under dev-container/scripts for adding/removing patches, applying all patches, restarting Pulp services, and running the test suite. Ensure these scripts are installed into /usr/local/bin via the Dockerfile and marked executable for use inside the container.	`dev-container/scripts/pulp-add-patch` `dev-container/scripts/pulp-apply-all-patches` `dev-container/scripts/pulp-remove-patch` `dev-container/scripts/pulp-restart` `dev-container/scripts/pulp-test`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey - I've found 1 issue, and left some high level feedback:

In the metrics/aggregation pipeline, deltatorumulative is currently unscoped and will affect all metrics; consider adding an include filter similar to cumulativetodelta so only api.request_duration is converted back to cumulative.
The dev container Dockerfile applies patches but only logs a warning on failure; if these patches are required for correctness, it would be safer to fail the build (or gate it behind an explicit opt-out) instead of proceeding with a partially patched environment.
In dev-container/entrypoint.sh, the pip install -e /workspace/pulp-service/pulp_service || true hides installation errors for the editable checkout; consider surfacing or failing on these errors so a misconfigured or missing workspace repo doesn’t silently degrade the dev environment.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- In the `metrics/aggregation` pipeline, `deltatorumulative` is currently unscoped and will affect all metrics; consider adding an `include` filter similar to `cumulativetodelta` so only `api.request_duration` is converted back to cumulative.
- The dev container Dockerfile applies patches but only logs a warning on failure; if these patches are required for correctness, it would be safer to fail the build (or gate it behind an explicit opt-out) instead of proceeding with a partially patched environment.
- In `dev-container/entrypoint.sh`, the `pip install -e /workspace/pulp-service/pulp_service || true` hides installation errors for the editable checkout; consider surfacing or failing on these errors so a misconfigured or missing workspace repo doesn’t silently degrade the dev environment.

## Individual Comments

### Comment 1
<location path="deploy/clowdapp.yaml" line_range="52-56" />
<code_context>
+              - api.request_duration
+            match_type: strict
+
+        deltatorumulative:
+
         batch:
</code_context>
<issue_to_address>
**issue (bug_risk):** Processor name `deltatorumulative` looks like a typo and may not match the actual OpenTelemetry processor name.

This name doesn’t match the usual OTEL `deltatocumulative` processor and may prevent the processor from being instantiated, breaking delta→cumulative conversion for these metrics. Please verify the intended processor and update both its definition and references to use the correct name.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

dkliban · 2026-04-20T17:16:39Z

/retest

dkliban · 2026-04-20T23:50:53Z

/retest

…egation Gunicorn worker recycling causes in-memory Prometheus counters to reset. The OTel aggregation pipeline strips worker.name and sums all workers into a single cumulative counter via groupbyattrs. When a worker recycles, its counter resets to 0, decreasing the aggregate. This manifests as a "hidden counter reset" in Prometheus: if the recycled worker's final le=+Inf value coincidentally equals the new worker's starting value (e.g. both are 1 because the new worker immediately handled a slow request), Prometheus does not detect the reset for le=+Inf. But le=1000 resets visibly. This inflates rate(le=1000) relative to rate(le=+Inf), producing SLI ratios greater than 1. Fix: insert cumulativetodelta before worker aggregation so we sum per-worker deltas (always non-negative) instead of cumulative totals. Worker recycles produce a 0-delta rather than a negative value that corrupts the aggregate. Add deltatorumulative after groupbyattrs to convert the aggregate delta back to a cumulative counter for the Prometheus exporter. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

sourcery-ai Bot reviewed Apr 20, 2026

View reviewed changes

Comment thread deploy/clowdapp.yaml Outdated

dkliban force-pushed the fix-otel-histogram-counter-resets branch 2 times, most recently from 5c996e0 to d57cf4c Compare April 20, 2026 16:38

dkliban force-pushed the fix-otel-histogram-counter-resets branch from d57cf4c to 7e1f670 Compare April 21, 2026 01:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prevent gunicorn worker recycling from corrupting histogram aggregation#1068

fix: prevent gunicorn worker recycling from corrupting histogram aggregation#1068
dkliban wants to merge 1 commit intopulp:mainfrom
dkliban:fix-otel-histogram-counter-resets

dkliban commented Apr 20, 2026 •

edited

Loading

Uh oh!

sourcery-ai Bot commented Apr 20, 2026 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

Uh oh!

dkliban commented Apr 20, 2026

Uh oh!

dkliban commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dkliban commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Verification

Test plan

Uh oh!

sourcery-ai Bot commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for updated OTel metrics aggregation pipeline

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dkliban commented Apr 20, 2026

Uh oh!

dkliban commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dkliban commented Apr 20, 2026 •

edited

Loading

sourcery-ai Bot commented Apr 20, 2026 •

edited

Loading