SuperOffice experienced a major incident on May 21, 2024, lasting 51m. The incident has been resolved; the full update timeline is below.
Update timeline
- investigating May 21, 2024, 01:25 PM UTC
We are currently investigating this issue.
- monitoring May 21, 2024, 01:35 PM UTC
A fix has been implemented and we are monitoring the results.
- resolved May 21, 2024, 02:17 PM UTC
This incident has been resolved.
- postmortem May 29, 2024, 02:51 PM UTC
**Date:** May 21st **Start Time:** 2:36 PM **End Time:** 3:30 PM **Impact:** Webserver performance degradation, Authentication service instability, and user login problems. ## Summary After the Antivirus software upgrade on May the 15th, we observed a marked degradation in the performance of our webservers. This was characterized by increased load times and slower response rates, which directly impacted user experience. ### Immediate Response In response to the emerging performance concerns, a decision was made to downgrade all Antivirus agents on the affected webservers. The goal was to revert to the previous stable version to restore system performance. ### Consequences of the Decision The downgrading process itself imposed a substantial load on the webservers. Additionally, the Authentication service, responsible for authentication and authorization, suffered a significant performance hit. This, in turn, caused instability within the service and resulted in widespread login problems for our users. ## Root Cause Analysis * **Antivirus Upgrade:** The initial upgrade introduced unexpected resource consumption that was not anticipated during the pre-deployment testing phase. * **Downgrade Process:** The simultaneous downgrading of all agents created a bottleneck, as the servers were already under stress from the upgrade's performance impact. * **Authentication Service Overload:** The compounded load from both the upgrade and the downgrade overwhelmed the authentication service, leading to its instability. ## Resolution and Recovery * **Service Stabilization:** We prioritized stabilizing the authentication service to mitigate login issues and restore user access. * **Performance Monitoring:** Enhanced monitoring was put in place to closely observe the webservers' performance and ensure system stability. **Corrective Measures:** To prevent future occurrences of this nature, we are: * Implementing additional monitoring alerts for early detection of abnormal load patterns. * Reviewing our change management procedures to ensure better handling of critical infrastructure updates. * Conducting a thorough investigation to understand the interdependencies between service components during maintenance tasks. We apologize for any inconvenience caused and appreciate your understanding as we continuously strive to improve our services.