Harden Redis connection lifecycle to prevent Render crash loops#45
Conversation
✅ Deploy Preview for pagepersonai ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: efe663d61b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (!client) { | ||
| throw new Error("Redis unavailable: sendCommand failed in degraded mode"); | ||
| } |
There was a problem hiding this comment.
Avoid throwing in degraded Redis sendCommand path
redisClient.sendCommand now throws whenever ensureRedisConnected() returns null (disabled Redis or initial connect failure), but the rate limiter always calls this method when REDIS_URL is set. In those outage/disabled scenarios, each rate-limit check becomes an exception instead of a graceful degraded-mode fallback, which can break request handling paths that depend on the limiter middleware.
Useful? React with 👍 / 👎.
| if (!connectPromise) { | ||
| connectPromise = baseRedisClient | ||
| .connect() | ||
| .then((client) => { |
There was a problem hiding this comment.
Add backoff guard after failed explicit connect attempts
After a failed baseRedisClient.connect(), connectPromise is always cleared in finally, so every subsequent Redis operation immediately starts a brand-new connect attempt. During a Redis outage, request-driven cache/rate-limit calls can repeatedly trigger fresh connects and log spam, effectively bypassing the intended bounded retry behavior in reconnectStrategy.
Useful? React with 👍 / 👎.
Motivation
@redis/clientsocket errors, which bubbled to an unhandled EventEmittererrorand terminated Node.Description
server/src/utils/redis-client.tswith socket tuning (connectTimeout,keepAlive,keepAliveInitialDelay) and a boundedreconnectStrategyto avoid aggressive/unbounded retries.error,ready,end,reconnecting) and explicit logging via the app logger to prevent unhandled emitter errors and improve diagnostics.connectPromiseand methods that operate in degraded mode (returningnull/0where appropriate) when Redis is disabled/unavailable.REDIS_URLvalidation and sanitized warnings about TLS mismatches (recommendrediss://for managed providers) without logging secrets.getStatus()on the facade and surfaced Redis status in the health endpoint atserver/src/routes/monitor-route.ts(addedredis.configuredandredis.status).unhandledRejectionanduncaughtExceptioninserver/src/index.tsto avoid silent crash loops while still surfacing errors.server/src/config/rateLimiter.tssorate-limit-rediscan callsendCommandon the facade.REDIS_URLTLS guidance and that the app boots in degraded mode if Redis is missing.server/src/utils/__tests__/redis-client.test.tsto validate lifecycle listener registration and degraded-mode behavior.Modified files (key):
server/src/utils/redis-client.ts(main changes)server/src/routes/monitor-route.ts(health status)server/src/index.ts(process-level handlers)server/src/config/rateLimiter.ts(sendCommand typing bridge)server/src/utils/__tests__/redis-client.test.ts(tests)README.md(env guidance)Testing
npm run test --workspace=server -- src/utils/__tests__/redis-client.test.tsand all tests passed (5/5).npm run build --workspace=serverfailed due to an existing workspace TypeScript config/include issue around@pagepersonai/shared(this is unrelated to the Redis changes and pre-existed the patch).npm run start --workspace=serverfailed in this environment due to a missing@pagepersonai/shared/dist/index.jsruntime artifact (also unrelated to the Redis changes).Notes: the code now intentionally boots in degraded mode when
REDIS_URLis not set or initial connection fails, avoiding process exits on Redis outages while logging explicit transition events; if a strictly-fail-fast behavior is required instead, that can be toggled by policy or environment variable.Codex Task