Mindtickle experienced a major incident on November 20, 2023 affecting Course / Quick-Update / Assessment and Mission and 1 more component, lasting 10m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Nov 20, 2023, 01:01 PM UTC
The admin and learning instances of the platform are not accessible. We are currently investigating the issue.
- identified Nov 20, 2023, 01:04 PM UTC
The issue has been identified and a fix is being implemented.
- monitoring Nov 20, 2023, 01:08 PM UTC
A fix has been implemented and we are monitoring the results.
- resolved Nov 20, 2023, 01:12 PM UTC
The incident has been resolved. The admin and learning instances of the Mindtickle platform are now accessible.
- postmortem Dec 04, 2023, 03:05 PM UTC
**What Happened?** The login to the admin and learning site of the Mindtickle platform was impacted from Nov 20th 2023, 04:40 am PT to Nov 20th 2023, 05:08 am PT. **Root Cause:** One of the queries did not execute properly and ended up running in a loop. This resulted in a sudden spike in CPU utilization in a concise duration of time, which impacted the database node and became unresponsive. The node could not execute a graceful failover, so requests to the node kept increasing and eventually failed. During the spike in utilization, we received an alert and the team had already started investigating the issue. Once we identified the issue with the specific node, we immediately removed it for the new node to come up and also ended the long query. This freed up the CPU usage and the requests started processing normally. **Timeline of events:** * Nov 20th 2023, 04:40 am PT - Login to the admin and learning site of the Mindtickle platform was impacted. * Nov 20th 2023, 04:43 am PT - Multiple pagers were triggered as Node was unable to execute a graceful failover leading to all requests failing. * Nov 20th 2023, 04:45 am PT - The impacted node was identified which was not responding and initiated a manual removal. * Nov 20th 2023, 04:50 am PT - The new node was available. * Nov 20th 2023, 04:55 am - 05:05 am PT - The new node was added back, and all the requests which were a part of this node started processing successfully. * Nov 20th 2023, 05:08 am PT - The system was back to normal and traffic was restored as usual. **Learning and Next Steps:** * We are revisiting the failover process for all the key components on the Mindtickle platform. * We are also revisiting the query timeouts to ensure long queries do not result in a spike in utilization.