Skip to content

fix: improve performance, upgraded otel fastify#902

Merged
fenos merged 1 commit intomasterfrom
perf/otel-upgrade-reduce-allocations
Mar 11, 2026
Merged

fix: improve performance, upgraded otel fastify#902
fenos merged 1 commit intomasterfrom
perf/otel-upgrade-reduce-allocations

Conversation

@fenos
Copy link
Contributor

@fenos fenos commented Mar 10, 2026

What kind of change does this PR introduce?

Improvement

What is the new behavior?

  • Upgraded deprecated otel library @opentelemetry/instrumentation-fastify to https://www.npmjs.com/package/@fastify/otel
  • Removed duplicates metrics
  • Improve allocations and performance on logs filters
  • Use batch observe to monitor pool metrics

@fenos fenos requested a review from a team as a code owner March 10, 2026 10:57
@coderabbitai
Copy link

coderabbitai bot commented Mar 10, 2026

📝 Walkthrough

Summary by CodeRabbit

Release Notes

  • New Features

    • Enhanced observability with batch-based database connection pool metrics collection (non-blocking).
    • New Grafana dashboard visualizations including Node.js Runtime, memory, and file descriptor metrics.
  • Improvements

    • Upgraded OpenTelemetry Fastify instrumentation for better trace span attribute collection.
    • Optimized query parameter redaction and header filtering for improved performance.
  • Chores

    • Updated OpenTelemetry dependencies and consolidated metrics registration patterns.

Walkthrough

Reworks tracing and metrics integration: replaces @opentelemetry/instrumentation-fastify with @fastify/otel (package.json updated) and uses req.opentelemetry()?.span in request hooks to set span attributes (tenant.ref, trace.mode). Metrics registration centralized via a new registerMetric and a public meter; pool metrics switched from per-pool interval monitors to a batched, non-blocking collector using PoolStats, TenantPool.getPoolStats() and PoolStrategy.getPoolStats(). Header allowlisting and query-param redaction logic simplified. Grafana dashboard JSON reorganized.

Sequence Diagram(s)

mermaid
sequenceDiagram
participant Client
participant Fastify as Fastify Server
participant FastifyOtel as @fastify/otel
participant AppHooks as App Request Hooks
participant ReqOp as req.opentelemetry()
participant TenantPool as Tenant DB Pool
participant PoolManager as PoolManager Collector
participant Meter as OTEL Meter
participant BatchObs as BatchObservable Callback
participant Prometheus as Metrics Exporter

Client->>Fastify: HTTP request
Fastify->>FastifyOtel: auto-instrumentation captures request
FastifyOtel->>AppHooks: invoke app hooks for request
AppHooks->>ReqOp: req.opentelemetry() -> span
ReqOp-->>AppHooks: span available (optional)
AppHooks->>ReqOp: set span attributes (tenant.ref, trace.mode)
AppHooks->>TenantPool: acquire DB connection
TenantPool-->>AppHooks: connection returned
PoolManager->>BatchObs: schedule periodic batch collection
BatchObs->>TenantPool: call getPoolStats() (chunked)
TenantPool-->>BatchObs: PoolStats (used, total)
BatchObs->>Meter: record observable gauge values
Meter->>Prometheus: expose metrics for scraping

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@fenos fenos force-pushed the perf/otel-upgrade-reduce-allocations branch from 93c10aa to e49ff7a Compare March 10, 2026 11:03
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/internal/monitoring/metrics.ts`:
- Around line 222-248: registerObservableGauge currently creates observable
instruments but never checks the entry.enabled flag when emitting values, so
dbActivePool, dbActiveConnection, and dbInUseConnection are always observed;
modify registerObservableGauge to register the observable with a callback that
first checks the created entry.enabled (or disabledMetrics) and only records
values when enabled, leaving metricsRegistry.set(name, entry) intact and
ensuring the same entry object (used by setMetricsEnabled) controls observation;
reference registerObservableGauge, the returned entry,
db_active_local_pools/db_connections/db_connections_in_use,
dbActivePool/dbActiveConnection/dbInUseConnection, metricsRegistry, and
disabledMetrics when implementing the check.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: e58a638f-ce06-4c65-846e-9c13c11016f0

📥 Commits

Reviewing files that changed from the base of the PR and between 93c10aa and e49ff7a.

⛔ Files ignored due to path filters (1)
  • package-lock.json is excluded by !**/package-lock.json
📒 Files selected for processing (7)
  • package.json
  • src/http/plugins/tracing.ts
  • src/internal/database/pool.ts
  • src/internal/monitoring/logger.ts
  • src/internal/monitoring/metrics.ts
  • src/internal/monitoring/otel-metrics.ts
  • src/internal/monitoring/otel-tracing.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/http/plugins/tracing.ts

@fenos fenos force-pushed the perf/otel-upgrade-reduce-allocations branch from e49ff7a to 3bbd95c Compare March 11, 2026 08:37
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
monitoring/grafana/dashboards/storage-otel.json (1)

1007-1014: ⚠️ Potential issue | 🟠 Major

This dashboard will import with blank data.

Line 5743 and Line 5744 pin the default time range to a 55-second window on March 10, 2026, so a fresh import opens on stale/no data. Several new PromQL selectors also no longer match the current metrics surface: status_code=~"2xx"/"4xx"/"5xx" conflicts with the numeric status selectors already used elsewhere in this file, and storage_api_otel_db_pool no longer matches the renamed db_active_local_pools instrument in src/internal/monitoring/metrics.ts. Those panels will stay empty until the queries and default time range are updated.

Also applies to: 1194-1229, 1436-1445, 1827-1958, 2929-2937, 5742-5744

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@monitoring/grafana/dashboards/storage-otel.json` around lines 1007 - 1014,
The dashboard panels import empty because a fixed dashboard time range is pinned
to a 55s window (remove or replace the dashboard "time" object so Grafana uses a
relative or default range like "now-6h"/"now" or $__from/$__to) and several
PromQL queries reference a renamed metric and incompatible status label
selectors; update any panel "expr" that uses storage_api_otel_db_pool to use the
renamed metric db_active_local_pools, and replace status_code=~"2xx"/"4xx"/"5xx"
with the numeric-style selectors used elsewhere (e.g., a regex like
status_code=~"2[0-9][0-9]" / "4[0-9][0-9]" / "5[0-9][0-9]" or the label name
used across this file), and apply these changes to all affected panels (those
using the "expr" field shown and the other ranges noted) so panels show current
data after import.
♻️ Duplicate comments (2)
src/http/plugins/tracing.ts (1)

31-38: ⚠️ Potential issue | 🟠 Major

Set these on the request span, not the active hook span.

@fastify/otel exposes the per-request span on req.opentelemetry().span and documents requestHook as the callback that runs immediately after creating that root request span; lifecycle hooks get separate hook spans. Using trace.getActiveSpan() inside onRequest is therefore the wrong handle for request-level attrs here, so tenant.ref / trace.mode can land on the hook span instead of the request span. Use the request span handle here, or move the synchronous part into requestHook. (github.com)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/http/plugins/tracing.ts` around lines 31 - 38, The attributes are being
set on the active hook span via trace.getActiveSpan() instead of the request
root span; change the code in the onRequest handler to retrieve the per-request
span from request.opentelemetry().span (or move this logic into the requestHook)
and call setAttribute('tenant.ref', request.tenantId) and
setAttribute('trace.mode', request.tracingMode) on that request span
(referencing the existing span variable usage and trace.getActiveSpan() in the
current diff) so tenant.ref / trace.mode land on the request span rather than
the hook span.
src/internal/monitoring/metrics.ts (1)

55-75: ⚠️ Potential issue | 🟠 Major

Metric disable controls are now effectively dead.

registerMetric() records enabled, but nothing consults it before emitting. After this refactor, METRICS_DISABLED and setMetricsEnabled() stop suppressing the synchronous instruments here, and the observable pool gauges still have no path to honor the flag.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/internal/monitoring/metrics.ts` around lines 55 - 75, registerMetric
currently records enabled in metricsRegistry but never uses it; update
registerMetric to consult the enabled flag (derived from disabledMetrics /
setMetricsEnabled / METRICS_DISABLED) and make synchronous instruments respect
it by returning a no-op instrument or wrapping inst.record and inst.add to
early-return when !enabled; keep existing tenant-attr stripping
(prometheusMetricsIncludeTenantId & stripTenantAttrs) but ensure the wrapper
checks metricsRegistry.get(name).enabled before calling original, and likewise
ensure any observable pool gauges/observable callbacks check the same enabled
flag before emitting/observing so disabled metrics are actually suppressed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@monitoring/grafana/dashboards/storage-otel.json`:
- Around line 1007-1014: The dashboard panels import empty because a fixed
dashboard time range is pinned to a 55s window (remove or replace the dashboard
"time" object so Grafana uses a relative or default range like "now-6h"/"now" or
$__from/$__to) and several PromQL queries reference a renamed metric and
incompatible status label selectors; update any panel "expr" that uses
storage_api_otel_db_pool to use the renamed metric db_active_local_pools, and
replace status_code=~"2xx"/"4xx"/"5xx" with the numeric-style selectors used
elsewhere (e.g., a regex like status_code=~"2[0-9][0-9]" / "4[0-9][0-9]" /
"5[0-9][0-9]" or the label name used across this file), and apply these changes
to all affected panels (those using the "expr" field shown and the other ranges
noted) so panels show current data after import.

---

Duplicate comments:
In `@src/http/plugins/tracing.ts`:
- Around line 31-38: The attributes are being set on the active hook span via
trace.getActiveSpan() instead of the request root span; change the code in the
onRequest handler to retrieve the per-request span from
request.opentelemetry().span (or move this logic into the requestHook) and call
setAttribute('tenant.ref', request.tenantId) and setAttribute('trace.mode',
request.tracingMode) on that request span (referencing the existing span
variable usage and trace.getActiveSpan() in the current diff) so tenant.ref /
trace.mode land on the request span rather than the hook span.

In `@src/internal/monitoring/metrics.ts`:
- Around line 55-75: registerMetric currently records enabled in metricsRegistry
but never uses it; update registerMetric to consult the enabled flag (derived
from disabledMetrics / setMetricsEnabled / METRICS_DISABLED) and make
synchronous instruments respect it by returning a no-op instrument or wrapping
inst.record and inst.add to early-return when !enabled; keep existing
tenant-attr stripping (prometheusMetricsIncludeTenantId & stripTenantAttrs) but
ensure the wrapper checks metricsRegistry.get(name).enabled before calling
original, and likewise ensure any observable pool gauges/observable callbacks
check the same enabled flag before emitting/observing so disabled metrics are
actually suppressed.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 7ce40c1f-dd40-4cd6-8e26-77071e818aa5

📥 Commits

Reviewing files that changed from the base of the PR and between e49ff7a and 3bbd95c.

⛔ Files ignored due to path filters (1)
  • package-lock.json is excluded by !**/package-lock.json
📒 Files selected for processing (10)
  • monitoring/grafana/dashboards/storage-otel.json
  • package.json
  • src/http/plugins/metrics.ts
  • src/http/plugins/tracing.ts
  • src/http/routes/tus/lifecycle.ts
  • src/internal/database/pool.ts
  • src/internal/monitoring/logger.ts
  • src/internal/monitoring/metrics.ts
  • src/internal/monitoring/otel-metrics.ts
  • src/internal/monitoring/otel-tracing.ts
💤 Files with no reviewable changes (2)
  • src/http/plugins/metrics.ts
  • src/http/routes/tus/lifecycle.ts

@fenos fenos force-pushed the perf/otel-upgrade-reduce-allocations branch 2 times, most recently from ad725b7 to 1d97551 Compare March 11, 2026 09:39
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@monitoring/grafana/dashboards/storage-otel.json`:
- Around line 5593-5594: The dashboard JSON currently hard-codes a fixed time
window using the "from" and "to" properties (e.g., "from":
"2026-03-10T12:33:29.823Z", "to": "2026-03-10T12:34:24.902Z"); change these to
relative time strings (for example "from": "now-5m", "to": "now") so the
dashboard opens with a live range. Locate and update the "from" and "to" keys in
the dashboard root/timeRange object (and any other occurrences of fixed ISO
timestamps) to use relative expressions like now-5m/now to ensure fresh imports
open with current data.
- Around line 1165-1201: The panels are querying the wrong series name and using
literal "2xx"/"4xx"/"5xx" label values; replace occurrences of
storage_api_otel_http_request_duration_seconds_count with
storage_api_http_request_duration_seconds_count (the true series produced by
http_request_duration_seconds) and change status_code=~"2xx"/"4xx"/"5xx" to a
proper regex that matches numeric HTTP codes such as
status_code=~"2.."/"4.."/"5.." (or more strictly status_code=~"^2\\d{2}$", etc.)
wherever the expressions reference
storage_api_otel_http_request_duration_seconds_count and status_code label
filters so the panels (refs: the expressions with refId A/B/C and other similar
blocks) will return correct results.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 705c6ec1-e3ec-4d79-8833-823270043bbd

📥 Commits

Reviewing files that changed from the base of the PR and between 3bbd95c and 1d97551.

⛔ Files ignored due to path filters (1)
  • package-lock.json is excluded by !**/package-lock.json
📒 Files selected for processing (13)
  • monitoring/grafana/dashboards/storage-otel.json
  • package.json
  • src/http/plugins/log-request.ts
  • src/http/plugins/metrics.ts
  • src/http/plugins/tracing.ts
  • src/http/routes/s3/index.ts
  • src/http/routes/tus/lifecycle.ts
  • src/internal/database/pool.ts
  • src/internal/monitoring/logger.ts
  • src/internal/monitoring/metrics.ts
  • src/internal/monitoring/otel-metrics.ts
  • src/internal/monitoring/otel-tracing.ts
  • src/start/server.ts
💤 Files with no reviewable changes (2)
  • src/http/routes/tus/lifecycle.ts
  • src/http/plugins/metrics.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • package.json

@fenos fenos enabled auto-merge (squash) March 11, 2026 12:49
collectInProgress = true

try {
const entries = [...tenantPools.entries()]
Copy link
Member

@ferhatelmas ferhatelmas Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some conditions below where a pool wasn't added into the cache. Missing them is intentional/insignificant because they are short-lived?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was referring to this condition

export const httpRequestsTotal = withMetricLabels(
'http_requests_total',
'counter',
export const httpRequestsTotal = registerMetric('http_requests_total', 'counter', () =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good to drop at this point?

Comment on lines +51 to +54
function stripTenantAttrs(attrs: Attributes): Attributes {
const { tenantId, tenant_id, ...rest } = attrs as Record<string, unknown>
return rest as Attributes
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it worth of checking the config and not adding in the first place?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but it would be too manny if in too many places

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it will be a bit messier since will clutter usage. We can do it as a follow up (if needed)


const attributes = {
method,
route,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the reason? cardinality + covered by operation?

sounds good but there are still references to it in dashboard though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, cardinality, we do use the operation only, i don't think we reference the route

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Top 10 panels refer it

@fenos fenos force-pushed the perf/otel-upgrade-reduce-allocations branch from d696c92 to 092b090 Compare March 11, 2026 17:31
Copy link
Member

@ferhatelmas ferhatelmas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before merge:

  • ensure dashboard doesn't refer dropped metric or labels

Later:

  • handle non-cached pool metrics
  • bring tenant label addition to metric recording site to prevent allocation (if needed)

@fenos fenos merged commit c82f9dd into master Mar 11, 2026
3 checks passed
@fenos fenos deleted the perf/otel-upgrade-reduce-allocations branch March 11, 2026 19:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants