Entitle incident

Entitle US Region down time

Entitle experienced a major incident on June 16, 2025 affecting Entitle Portal, lasting 11m. The incident has been resolved; the full update timeline is below.

Started: Jun 16, 2025, 06:48 PM UTC
Resolved: Jun 16, 2025, 07:00 PM UTC
Duration: 11m
Detected by Pingoru: Jun 16, 2025, 06:48 PM UTC

Affected components

Entitle Portal

Update timeline

investigating Jun 16, 2025, 06:48 PM UTC

We are currently investigating this issue.
resolved Jun 16, 2025, 07:00 PM UTC

This incident has been resolved.
postmortem Jun 16, 2025, 07:56 PM UTC

## Postmortem: Redis Outage — Entitle US Region **Date:** 2025-06-16 **Service Affected:** Entitle – US Region **Root Cause:** Redis pods CrashLoopBackOff ### Summary On **June 16, 2025**, the Entitle US region experienced a **partial service outage** due to a failure in the **Redis** backend service. The Redis pod entered a `CrashLoopBackOff` state, failing to restart successfully. This caused cascading issues in dependent services, including elevated Redis client errors, Pub/Sub disconnections, and connection retry exhaustion. The issue was resolved by performing a **manual hard reset** of the Redis pod. The Redis container has **crashed unexpectedly**, and Kubernetes was **unable to successfully restart it** due to repeated startup failures. This caused a `BackOff` state, where Kubernetes delays further restarts. The downstream Node.js services attempted to reconnect continuously, leading to: * **Redis connection timeouts** * **Pub/Sub disconnections \(**`EOF`\) * **Memory pressure due to too many event listeners** This condition persisted until the pod was **forcefully deleted**, which reset the backoff and allowed Redis to start cleanly. Action Items : A **new monitor was created** to track `CrashLoopBackOff` and `BackOff` events in critical infrastructure pods. This will allow us to detect and respond to container restart failures earlier — potentially preventing downtime through intervention.