Skip to content

Harden Redis connection lifecycle to prevent Render crash loops#45

Merged
sjkd23 merged 1 commit into
mainfrom
codex/fix-redis-client-crash-in-production
Feb 5, 2026
Merged

Harden Redis connection lifecycle to prevent Render crash loops#45
sjkd23 merged 1 commit into
mainfrom
codex/fix-redis-client-crash-in-production

Conversation

@sjkd23
Copy link
Copy Markdown
Owner

@sjkd23 sjkd23 commented Feb 5, 2026

Motivation

  • Production processes were crashing from unhandled @redis/client socket errors, which bubbled to an unhandled EventEmitter error and terminated Node.
  • The Redis client needed resilience (single shared client, guarded connect, reconnect/backoff, socket options) and TLS/URL validation to avoid misconfigured managed providers.
  • Health / observability and minimal process-level safety nets were required so outages don’t produce silent crash loops.

Description

  • Reworked the shared Redis module into a resilient singleton facade at server/src/utils/redis-client.ts with socket tuning (connectTimeout, keepAlive, keepAliveInitialDelay) and a bounded reconnectStrategy to avoid aggressive/unbounded retries.
  • Added mandatory lifecycle listeners on the raw client (error, ready, end, reconnecting) and explicit logging via the app logger to prevent unhandled emitter errors and improve diagnostics.
  • Ensured single guarded connection attempts with a shared connectPromise and methods that operate in degraded mode (returning null/0 where appropriate) when Redis is disabled/unavailable.
  • Added safe REDIS_URL validation and sanitized warnings about TLS mismatches (recommend rediss:// for managed providers) without logging secrets.
  • Exposed getStatus() on the facade and surfaced Redis status in the health endpoint at server/src/routes/monitor-route.ts (added redis.configured and redis.status).
  • Added process-level logging handlers for unhandledRejection and uncaughtException in server/src/index.ts to avoid silent crash loops while still surfacing errors.
  • Minor typing/bridge in server/src/config/rateLimiter.ts so rate-limit-redis can call sendCommand on the facade.
  • Updated README to document REDIS_URL TLS guidance and that the app boots in degraded mode if Redis is missing.
  • Added/updated unit tests at server/src/utils/__tests__/redis-client.test.ts to validate lifecycle listener registration and degraded-mode behavior.

Modified files (key):

  • server/src/utils/redis-client.ts (main changes)
  • server/src/routes/monitor-route.ts (health status)
  • server/src/index.ts (process-level handlers)
  • server/src/config/rateLimiter.ts (sendCommand typing bridge)
  • server/src/utils/__tests__/redis-client.test.ts (tests)
  • README.md (env guidance)

Testing

  • Ran unit tests for the Redis client: npm run test --workspace=server -- src/utils/__tests__/redis-client.test.ts and all tests passed (5/5).
  • Attempted full TypeScript build: npm run build --workspace=server failed due to an existing workspace TypeScript config/include issue around @pagepersonai/shared (this is unrelated to the Redis changes and pre-existed the patch).
  • Attempted a local start: npm run start --workspace=server failed in this environment due to a missing @pagepersonai/shared/dist/index.js runtime artifact (also unrelated to the Redis changes).

Notes: the code now intentionally boots in degraded mode when REDIS_URL is not set or initial connection fails, avoiding process exits on Redis outages while logging explicit transition events; if a strictly-fail-fast behavior is required instead, that can be toggled by policy or environment variable.


Codex Task

@netlify
Copy link
Copy Markdown

netlify Bot commented Feb 5, 2026

Deploy Preview for pagepersonai ready!

Name Link
🔨 Latest commit efe663d
🔍 Latest deploy log https://app.netlify.com/projects/pagepersonai/deploys/6984af1cb08e0f0008bc9fef
😎 Deploy Preview https://deploy-preview-45--pagepersonai.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@sjkd23 sjkd23 merged commit c4c590e into main Feb 5, 2026
5 of 8 checks passed
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: efe663d61b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +180 to +182
if (!client) {
throw new Error("Redis unavailable: sendCommand failed in degraded mode");
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid throwing in degraded Redis sendCommand path

redisClient.sendCommand now throws whenever ensureRedisConnected() returns null (disabled Redis or initial connect failure), but the rate limiter always calls this method when REDIS_URL is set. In those outage/disabled scenarios, each rate-limit check becomes an exception instead of a graceful degraded-mode fallback, which can break request handling paths that depend on the limiter middleware.

Useful? React with 👍 / 👎.

Comment on lines +120 to +123
if (!connectPromise) {
connectPromise = baseRedisClient
.connect()
.then((client) => {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Add backoff guard after failed explicit connect attempts

After a failed baseRedisClient.connect(), connectPromise is always cleared in finally, so every subsequent Redis operation immediately starts a brand-new connect attempt. During a Redis outage, request-driven cache/rate-limit calls can repeatedly trigger fresh connects and log spam, effectively bypassing the intended bounded retry behavior in reconnectStrategy.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant