Mindtickle incident

Mindtickle Admin and Learning Site Unavailable for Select Users

Mindtickle experienced a minor incident on September 20, 2025, lasting —. The incident has been resolved; the full update timeline is below.

Started: Sep 20, 2025, 10:29 AM UTC
Resolved: Sep 20, 2025, 06:40 AM UTC
Duration: —
Detected by Pingoru: Sep 20, 2025, 10:29 AM UTC

Update timeline

resolved Sep 20, 2025, 10:29 AM UTC

Between 19 Sep 2025, 23:40 to 10 Sep 2025, 00:50 PT, the Mindtickle admin and learning site were unavailable for select users in our US Production region. The platform has recovered and is fully functional now. We will share an RCA post a detailed postportem. Apologies for the inconvenience caused.
postmortem Sep 24, 2025, 04:52 AM UTC

## Incident Summary On September 19, 2025, during a periodic platform upgrade, the Mindtickle platform experienced an outage lasting approximately 1 hour and 43 minutes. The disruption was caused by a configuration issue on upgraded servers, which led to resource constraints. As a result, application services could not run as expected, causing downtime for a set of customers \(in the US region\). Our engineering team identified the issue, corrected the configuration, and rotated the affected servers. Services were fully restored and stabilized thereafter. * Start time: September 19, 2025, 11:06 PM PT * End time: September 20, 2025, 00:49 AM PT ## Impact * Workflows Impacted: All workflows on the platform * Customers Impacted: A select set of customers \(in the US region\) ## Incident Timeline \(PT\) * September 19, 2025, 11:06 PM: Users began experiencing downtime and errors * September 19, 2025, 11:10 PM: Engineering team detected the issue and initiated an investigation * September 19, 2025, 11:45 PM: Root cause identified \(server resource constraints\); remediation started * September 20, 2025, 00:30 AM: Services began recovering as corrected configurations were applied * September 20, 2025, 00:49 AM: All services confirmed healthy; incident resolved ## Root Cause The outage was caused by misconfigured disk sizes in newly upgraded servers. This resulted in resource shortages that prevented application services from running. This misconfiguration was not detected during pre-upgrade validation because upgrade scripts did not fully account for updated server requirements. ## Preventive Actions To prevent recurrence, we are implementing the following measures: 1. Configuration Management: Standardize and validate server configurations across environments. 2. Upgrade Safeguards: Introduce a staggered approach with cooldown periods between server pool rotations. 3. Runbook Enhancements: Update documentation with environment-specific requirements and lessons learned. 4. Proactive Monitoring: Enhance alerts to detect early signs of resource constraints. We sincerely apologize for the disruption this outage caused. We are committed to learning from this incident and strengthening our upgrade and validation processes to ensure greater reliability and resilience of the Mindtickle platform.