Skip to content

fix: prevent gunicorn worker recycling from corrupting histogram aggregation#1068

Open
dkliban wants to merge 1 commit intopulp:mainfrom
dkliban:fix-otel-histogram-counter-resets
Open

fix: prevent gunicorn worker recycling from corrupting histogram aggregation#1068
dkliban wants to merge 1 commit intopulp:mainfrom
dkliban:fix-otel-histogram-counter-resets

Conversation

@dkliban
Copy link
Copy Markdown
Member

@dkliban dkliban commented Apr 20, 2026

Problem

Gunicorn worker recycling (--max-requests) resets in-memory counters to 0. The OTel pipeline strips worker.name and sums all workers into a single cumulative counter via groupbyattrs. When a worker recycles, the aggregate can decrease.

This causes a "hidden counter reset": if the recycled worker's final le=+Inf bucket value coincidentally equals the new worker's starting value (e.g. both are 1 because the new worker immediately handled a slow request before the first scrape), Prometheus does not detect the reset for le=+Inf. But le=1000 resets visibly (new worker starts at 0, no fast requests yet). This inflates rate(le=1000) relative to rate(le=+Inf), producing SLI latency ratios greater than 1.

Fix

Set OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE=delta on pulp-api, pulp-content, and pulp-worker. With delta temporality, the SDK exports only the change since the last interval rather than a running total. When a worker recycles, its first export is a small non-negative delta — not a drop from a large cumulative value. The groupbyattrs aggregation then sums non-negative per-worker deltas, and the Prometheus exporter accumulates them into a monotonically increasing cumulative counter.

This approach requires no new OTel collector processors and preserves the existing cardinality reduction from attributes/remove_worker_name + groupbyattrs/api_aggregation.

Verification

Tested in an ephemeral environment. The OTel collector starts cleanly and pulp_api_request_duration_milliseconds_bucket is exported without a worker_name label, confirming the aggregation pipeline is functioning correctly.

Test plan

  • Deploy to stage
  • Confirm the SLI query returns ≤ 1 with a 1h window:
    sum(rate(pulp_api_request_duration_milliseconds_bucket{le="1000",exported_job="pulp-api",http_method=~"PUT|POST|PATCH",http_target!~".*/upload/.*"}[1h]))
    /
    sum(rate(pulp_api_request_duration_milliseconds_bucket{le="+Inf",exported_job="pulp-api",http_method=~"PUT|POST|PATCH",http_target!~".*/upload/.*"}[1h]))
    
  • Confirm no per-series ratios return +Inf:
    rate(pulp_api_request_duration_milliseconds_bucket{le="1000",...}[1h])
    / ignoring(le)
    rate(pulp_api_request_duration_milliseconds_bucket{le="+Inf",...}[1h])
    
  • Confirm raw counter sums remain monotonically increasing

🤖 Generated with Claude Code

@sourcery-ai
Copy link
Copy Markdown
Contributor

sourcery-ai Bot commented Apr 20, 2026

Reviewer's Guide

Adjusts the OTel metrics aggregation pipeline to convert per‑worker cumulative counters to deltas before aggregation, while also adding a full dev-container image, CI workflow, and Alcove automation for dependency upgrade workflows and local development.

Sequence diagram for updated OTel metrics aggregation pipeline

sequenceDiagram
    participant GunicornWorker as Gunicorn_worker
    participant OTelReceiver as OTel_metrics_receiver
    participant CumulativeToDelta as cumulativetodelta
    participant RemoveWorkerName as attributes_remove_worker_name
    participant Batch as batch_api_aggregation
    participant GroupByAttrs as groupbyattrs_api_aggregation
    participant DeltaToCumulative as deltatorumulative
    participant Prometheus as prometheus_exporter

    GunicornWorker->>OTelReceiver: emit api.request_duration (cumulative, per worker)
    OTelReceiver->>CumulativeToDelta: api.request_duration (cumulative)
    CumulativeToDelta-->>OTelReceiver: api.request_duration (delta, per worker)

    OTelReceiver->>RemoveWorkerName: api.request_duration (delta, per worker)
    RemoveWorkerName-->>Batch: api.request_duration (delta, worker.name removed)
    Batch-->>GroupByAttrs: api.request_duration (delta, grouped by api attrs)

    GroupByAttrs-->>DeltaToCumulative: api.request_duration (aggregated delta)
    DeltaToCumulative-->>Prometheus: api.request_duration (aggregated cumulative)

    Note over GunicornWorker,CumulativeToDelta: Worker recycle resets its local counter to 0
    Note over CumulativeToDelta,GroupByAttrs: Reset becomes 0-delta, preventing negative aggregate
    Note over DeltaToCumulative,Prometheus: Prometheus sees a clean cumulative counter stream
Loading

File-Level Changes

Change Details Files
Fix histogram aggregation under Gunicorn worker recycling by changing the OTel metrics pipeline to aggregate deltas instead of cumulative counters and re-expose as cumulative to Prometheus.
  • Define a cumulativetodelta processor that targets the api.request_duration metric with strict matching.
  • Insert cumulativetodelta into the metrics/aggregation pipeline before removing worker.name to convert per-worker cumulative counters to deltas.
  • Insert deltatorumulative after groupbyattrs/api_aggregation to turn the aggregated delta stream back into a cumulative counter for export.
  • Ensure the modified pipeline still uses the existing memory_limiter, filter, batch, and groupby processors in the correct order.
deploy/clowdapp.yaml
Introduce a dedicated dev container image to run pulp-service and its dependencies locally with supervisord-managed services.
  • Create a dev-container Dockerfile that installs Python, PostgreSQL 16, Redis, pulp-service, and support tooling into a venv, applies bundled patches, and configures a non-production Pulp setup.
  • Initialize and configure PostgreSQL and Redis inside the image, including trust-based pg_hba.conf, database/user creation, and data directories.
  • Wire in existing image assets (scripts, middleware, route helpers, patches) and collect static assets as the pulp user.
  • Expose Pulp API and content ports and declare /workspace as a shared volume for editable checkouts.
dev-container/Dockerfile
Add a supervised runtime entrypoint that bootstraps the dev environment and manages core services.
  • Implement an entrypoint script that starts PostgreSQL and Redis, creates the pulp database/user if missing, and runs Django migrations.
  • Optionally install pulp-service from /workspace/pulp-service in editable mode when present to support live code reloading.
  • Reset the admin password to a configurable default and then stop the ad-hoc DB/Redis processes so supervisord can take over.
  • Start supervisord using a custom config that manages PostgreSQL, Redis, and the Pulp API/content/worker processes with logs under /var/log/pulp.
dev-container/entrypoint.sh
dev-container/supervisord.conf
Document and automate usage of the dev container and add an Alcove-based dependency upgrade workflow.
  • Add .CLAUDE.md describing how to run and use the hosted-pulp-dev-env image, manage services, apply patches, run tests, and interact with the database.
  • Define an Alcove agent (upgrade-deps/AGENT.md) with a multi-phase procedure for upgrading pulpcore/plugin dependencies, handling patches, migrations, and tests, then opening PRs.
  • Add an Alcove task and workflow for scheduling and executing the dependency upgrade pipeline using the dev container image.
  • Provide a dev-focused Django settings module that connects to local PostgreSQL/Redis and enables domain support with token auth disabled.
.CLAUDE.md
.alcove/agents/upgrade-deps/AGENT.md
.alcove/tasks/upgrade-deps.yml
.alcove/workflows/upgrade-deps-pipeline.yml
dev-container/settings.py
Set up CI to build and publish the dev container image to GitHub Container Registry on each branch push.
  • Add a GitHub Actions workflow that builds dev-container/Dockerfile using buildx on every branch push.
  • Sanitize the branch name by replacing slashes with dashes and use it as the Docker tag.
  • Log in to GHCR with GITHUB_TOKEN and push images to ghcr.io//hosted-pulp-dev-env:.
  • Enable GHA cache for faster subsequent builds.
.github/workflows/build-dev-container.yml
Integrate helper scripts for patch management, service restart, and test execution in the dev container.
  • Introduce placeholder or new scripts under dev-container/scripts for adding/removing patches, applying all patches, restarting Pulp services, and running the test suite.
  • Ensure these scripts are installed into /usr/local/bin via the Dockerfile and marked executable for use inside the container.
dev-container/scripts/pulp-add-patch
dev-container/scripts/pulp-apply-all-patches
dev-container/scripts/pulp-remove-patch
dev-container/scripts/pulp-restart
dev-container/scripts/pulp-test

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • In the metrics/aggregation pipeline, deltatorumulative is currently unscoped and will affect all metrics; consider adding an include filter similar to cumulativetodelta so only api.request_duration is converted back to cumulative.
  • The dev container Dockerfile applies patches but only logs a warning on failure; if these patches are required for correctness, it would be safer to fail the build (or gate it behind an explicit opt-out) instead of proceeding with a partially patched environment.
  • In dev-container/entrypoint.sh, the pip install -e /workspace/pulp-service/pulp_service || true hides installation errors for the editable checkout; consider surfacing or failing on these errors so a misconfigured or missing workspace repo doesn’t silently degrade the dev environment.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In the `metrics/aggregation` pipeline, `deltatorumulative` is currently unscoped and will affect all metrics; consider adding an `include` filter similar to `cumulativetodelta` so only `api.request_duration` is converted back to cumulative.
- The dev container Dockerfile applies patches but only logs a warning on failure; if these patches are required for correctness, it would be safer to fail the build (or gate it behind an explicit opt-out) instead of proceeding with a partially patched environment.
- In `dev-container/entrypoint.sh`, the `pip install -e /workspace/pulp-service/pulp_service || true` hides installation errors for the editable checkout; consider surfacing or failing on these errors so a misconfigured or missing workspace repo doesn’t silently degrade the dev environment.

## Individual Comments

### Comment 1
<location path="deploy/clowdapp.yaml" line_range="52-56" />
<code_context>
+              - api.request_duration
+            match_type: strict
+
+        deltatorumulative:
+
         batch:
</code_context>
<issue_to_address>
**issue (bug_risk):** Processor name `deltatorumulative` looks like a typo and may not match the actual OpenTelemetry processor name.

This name doesn’t match the usual OTEL `deltatocumulative` processor and may prevent the processor from being instantiated, breaking delta→cumulative conversion for these metrics. Please verify the intended processor and update both its definition and references to use the correct name.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread deploy/clowdapp.yaml Outdated
@dkliban dkliban force-pushed the fix-otel-histogram-counter-resets branch 2 times, most recently from 5c996e0 to d57cf4c Compare April 20, 2026 16:38
@dkliban
Copy link
Copy Markdown
Member Author

dkliban commented Apr 20, 2026

/retest

1 similar comment
@dkliban
Copy link
Copy Markdown
Member Author

dkliban commented Apr 20, 2026

/retest

…egation

Gunicorn worker recycling causes in-memory Prometheus counters to reset.
The OTel aggregation pipeline strips worker.name and sums all workers
into a single cumulative counter via groupbyattrs. When a worker recycles,
its counter resets to 0, decreasing the aggregate.

This manifests as a "hidden counter reset" in Prometheus: if the recycled
worker's final le=+Inf value coincidentally equals the new worker's
starting value (e.g. both are 1 because the new worker immediately handled
a slow request), Prometheus does not detect the reset for le=+Inf. But
le=1000 resets visibly. This inflates rate(le=1000) relative to
rate(le=+Inf), producing SLI ratios greater than 1.

Fix: insert cumulativetodelta before worker aggregation so we sum
per-worker deltas (always non-negative) instead of cumulative totals.
Worker recycles produce a 0-delta rather than a negative value that
corrupts the aggregate. Add deltatorumulative after groupbyattrs to
convert the aggregate delta back to a cumulative counter for the
Prometheus exporter.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@dkliban dkliban force-pushed the fix-otel-histogram-counter-resets branch from d57cf4c to 7e1f670 Compare April 21, 2026 01:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant