Entitle experienced a major incident on August 25, 2025 affecting Access Change Requests, lasting 2h 23m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Aug 25, 2025, 08:32 AM UTC
We are currently investigating this issue.
- identified Aug 25, 2025, 09:18 AM UTC
The issue has been identified and a fix is being implemented.
- monitoring Aug 25, 2025, 09:55 AM UTC
A fix has been implemented and we are monitoring the results.
- resolved Aug 25, 2025, 10:55 AM UTC
This incident has been resolved.
- postmortem Aug 28, 2025, 07:26 PM UTC
# Service Degradation and Partial Downtime on August 25, 2025 ## Incident Overview On **August 25, 2025**, Entitle services experienced a major service disruption between **02:00 and 15:00 UTC**. During this period, customers encountered **JIT access requests processing failures** and **reduced web app frontend availability**. **Impact:** All customers were affected.` `**Duration:** ~14 hours of instability. Root Cause The disruption resulted from a **combination of two factors**: 1. **A new version of queue mechanism** introduced in the latest release was not fully resilient under failure conditions. 2. **Database resources utilization and configuration issues** led to instability, and restarting the system. Together, these factors created cascading failures that degraded core services. ## Response * Entitle engineering team mobilized and worked continuously to investigate and mitigate. * Temporary fixes improved stability, but a **full rollback to the prior version at 15:00 UTC** was ultimately required to restore full-service SLA. * Customers were kept informed through status page updates. ## Business Impact * Service availability was significantly reduced during the incident. * JIT access requests processing delays impacted customers operations. ## Preventive Actions We are implementing the following improvements to prevent recurrence: * **Safer Releases:** Automated rollback validation in every release cycle. * **Better Monitoring:** improving real-time alerts for JIT access requests backlogs and service degradation. * **Operational Resilience:** Expanded tests for third-party dependencies ## Closing Note We sincerely apologize for this disruption. We recognize the trust you place in us and are committed to learning from this incident. The actions outlined above are already in progress to ensure stronger reliability and resilience moving forward.