Entitle experienced a notice incident on March 24, 2025 affecting Access Change Requests, lasting —. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- resolved Mar 24, 2025, 09:12 PM UTC
Some customers may experience delays in the ticket granting mechanism.
- postmortem Mar 26, 2025, 02:50 PM UTC
**Incident Postmortem: Ticket Processing Partial Downtime** **Incident Date:** Monday, 3.24.2025 **Impact:** Partial downtime affecting ticket processing**Summary** On Monday, we experienced a partial downtime where submitted tickets became stuck in processing. This issue was identified after noticing a backlog in the ticket queue. To resolve the problem, we rescaled our job handler to handle more parallel jobs, which successfully cleared the backlog and restored normal processing.**Root Cause** The issue was caused by a large number of requests overwhelming the job handler, which was unable to handle the increased load. Additionally, the jobs were in the same queue along with other missions, which prevented them from executing in a timely manner. This led to a buildup in the queue, delaying ticket processing.**Resolution** Once the issue was identified, we rescaled our job handler to allow for more parallel job executions. This enabled faster processing of the stuck jobs and resolved the queue backlog.**Action Items** 1. **Separate the ticket processing job into a new queue** - Done * This prevents ticket processing from being impacted by other jobs in the future. 2. **Extend the lock duration for the ‘give access’ job** - Done * Ensures better handling of multiple jobs running in parallel without conflicts. 3. **Create specific monitoring for the new ticket queue** - Done * Improves visibility and allows for proactive detection of similar issues in the future. **Lessons Learned** * Monitoring gaps delayed detection of the issue. * Job queue separation is critical for isolating failures and ensuring reliability. * Scaling parallel processing dynamically can be an effective quick-fix but needs to be complemented with structural changes. **Next Steps** * Continuously monitor the new ticket queue to ensure stability. * Evaluate job locking mechanisms for other critical processes. * Implement automated alerts for job queue delays.