Mindtickle incident

High error rates observed on the Mindtickle platform

Minor Resolved View vendor source →

Mindtickle experienced a minor incident on September 3, 2025, lasting —. The incident has been resolved; the full update timeline is below.

Started
Sep 03, 2025, 04:58 AM UTC
Resolved
Aug 29, 2025, 07:49 PM UTC
Duration
Detected by Pingoru
Sep 03, 2025, 04:58 AM UTC

Update timeline

  1. resolved Sep 03, 2025, 04:58 AM UTC

    On August 29, 2025, between 12:49 PM and 1:17 PM PT, the Mindtickle platform experienced high error rates. During this period, multiple users may have encountered issues logging into the platform or accessing programs. The incident was resolved at 1:17 PM PT, and services have since been fully restored. We are conducting a detailed analysis of the incident and will share a Root Cause Analysis (RCA) once it is complete. We apologize for any disruption this may have caused and thank you for your patience.

  2. postmortem Sep 24, 2025, 04:44 AM UTC

    # **Incident Summary** On August 29, 2025, the Mindtickle platform experienced a temporary disruption where some users were unable to log in or access programs. The issue was identified and resolved within 28 minutes, restoring the platform to normal operation. * Start time: August 29, 2025, 12:49 PM PT * End time: August 29, 2025, 01:17 PM PT # **Impact Area** The following functionality was impacted during the incident: * User logins * Access to programs/assets \(assigned series, modules, and assets were impacted\) # **Incident Timeline** * **August 29, 2025, 12:49 PM PT:** Users began experiencing login and program access errors. * **August 29, 2025, 12:55 PM PT:** The Engineering team detected elevated error rates and initiated an investigation. * **August 29, 2025, 01:17 PM PT:** Corrective actions applied; services restored to a stable state. # **Root Cause Analysis** The disruption was caused by system resource exhaustion in one database cluster, which led to request timeouts and high error rates for affected services. Once identified, the engineering team stabilized services by resetting resource pools and prioritizing critical traffic. # **Next Steps and Preventive Actions** * **System Safeguards:** Implement circuit breakers to isolate and recover from failures faster. * **Resiliency Improvements:** Maintain priority channels for critical operations to reduce customer impact in similar scenarios.