SchooLinks experienced a critical incident on October 2, 2024 affecting SchooLinks Web App, lasting 41m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Oct 02, 2024, 02:58 PM UTC
We are currently experiencing high load and experiencing performance issues on some modules within the application
- monitoring Oct 02, 2024, 03:17 PM UTC
A fix has been deployed and we're monitoring the performance of the application
- resolved Oct 02, 2024, 03:40 PM UTC
This incident has been resolved. Performance is stable across the application.
- postmortem Oct 02, 2024, 04:16 PM UTC
### SchooLinks Platform Outage **Incident Summary:** On Oct 2, 2024 at 9:37am CDT, SchooLinks experienced a platform outage affecting a significant number of users. The outage was caused by a surge in user activity that led to an excessive number of database connections, ultimately causing our production database to high CPU utilization. The platform became unresponsive, and users encountered errors when trying to access the service. The initial outage lasted for approximately 20 minutes before service was restored, but some users experienced intermittent 500 errors until 10:32am CDT **Impact:** * **Affected Services**: Entire SchooLinks platform * **Error Rate**: Increased 500 errors during the outage period * **User Impact**: Users were unable to access or perform key actions on the platform during the outage * **Total Duration**: 52 minutes **Root Cause:** The root cause of the outage was a misconfiguration around how our system handles database connections under high demand. A sudden increase in platform usage created a high volume of requests to our database, which exceeded its connection and processing capacity. As a result, the writer instance of the production database became locked. **Resolution:** SchooLinks Engineering vertically scaled the writer instance of our production database to handle the surge in load. Once the new writer instance came online, there was an uptick in HTTP 500 errors. After investigation, we determined that the errors were caused by active connections that were still pointing to the old database writer. These stale connections were unable to route requests correctly. A redeployment our backend API forced connections to be routed to the new writer instance, we saw a return to normal operation and the 500 errors subsided. **Timeline of Events:** * **9:37 AM CDT**: Sudden increase in user activity leading to high traffic on the database. * **9:37 AM CDT**: Database locked, causing platform unresponsiveness. * **9:40 AM CDT**: Investigation revealed the production database was overwhelmed by open connections. * **9:57 AM CDT**: Vertical scaling of the writer instance initiated to restore service. * **9:58 AM CDT**: New database writer came online, but 500 errors were observed. * **10:10 AM CDT**: Force re-deployment of backend initiated to force connections to new database. * **10:32 AM CDT**: Resolution of stale connections, restoring full platform functionality. **Conclusion:** We deeply apologize for the inconvenience this outage caused our users. The reliability and performance of the SchooLinks platform are of utmost importance, and we are taking immediate steps to enhance our infrastructure to prevent similar incidents in the future, such as hardening our database connection management strategy and review our failover handling to ensure that these are working seamlessly in order for us to scale our services during periods of high demand. We appreciate your patience and understanding, and we are committed to continuously improving the platform to provide a smooth and seamless experience for all users. If you have any further questions or concerns, please do not hesitate to reach out to our support team. **Sincerely,** SchooLinks Engineering