SuperOffice incident

Connection issue in Online

Major Resolved View vendor source →

SuperOffice experienced a major incident on May 21, 2024, lasting 51m. The incident has been resolved; the full update timeline is below.

Started
May 21, 2024, 01:25 PM UTC
Resolved
May 21, 2024, 02:17 PM UTC
Duration
51m
Detected by Pingoru
May 21, 2024, 01:25 PM UTC

Update timeline

  1. investigating May 21, 2024, 01:25 PM UTC

    We are currently investigating this issue.

  2. monitoring May 21, 2024, 01:35 PM UTC

    A fix has been implemented and we are monitoring the results.

  3. resolved May 21, 2024, 02:17 PM UTC

    This incident has been resolved.

  4. postmortem May 29, 2024, 02:51 PM UTC

    **Date:** May 21st **Start Time:** 2:36 PM **End Time:** 3:30 PM **Impact:** Webserver performance degradation, Authentication service instability, and user login problems. ## Summary After the Antivirus software upgrade on May the 15th, we observed a marked degradation in the performance of our webservers. This was characterized by increased load times and slower response rates, which directly impacted user experience. ### Immediate Response In response to the emerging performance concerns, a decision was made to downgrade all Antivirus agents on the affected webservers. The goal was to revert to the previous stable version to restore system performance. ### Consequences of the Decision The downgrading process itself imposed a substantial load on the webservers. Additionally, the Authentication service, responsible for authentication and authorization, suffered a significant performance hit. This, in turn, caused instability within the service and resulted in widespread login problems for our users. ## Root Cause Analysis * **Antivirus Upgrade:** The initial upgrade introduced unexpected resource consumption that was not anticipated during the pre-deployment testing phase. * **Downgrade Process:** The simultaneous downgrading of all agents created a bottleneck, as the servers were already under stress from the upgrade's performance impact. * **Authentication Service Overload:** The compounded load from both the upgrade and the downgrade overwhelmed the authentication service, leading to its instability. ## Resolution and Recovery * **Service Stabilization:** We prioritized stabilizing the authentication service to mitigate login issues and restore user access. * **Performance Monitoring:** Enhanced monitoring was put in place to closely observe the webservers' performance and ensure system stability. **Corrective Measures:** To prevent future occurrences of this nature, we are: * Implementing additional monitoring alerts for early detection of abnormal load patterns. * Reviewing our change management procedures to ensure better handling of critical infrastructure updates. * Conducting a thorough investigation to understand the interdependencies between service components during maintenance tasks. We apologize for any inconvenience caused and appreciate your understanding as we continuously strive to improve our services.