SuperOffice incident

Network disruption

Major Resolved View vendor source →

SuperOffice experienced a major incident on May 15, 2024, lasting 2h 32m. The incident has been resolved; the full update timeline is below.

Started
May 15, 2024, 08:54 AM UTC
Resolved
May 15, 2024, 11:27 AM UTC
Duration
2h 32m
Detected by Pingoru
May 15, 2024, 08:54 AM UTC

Update timeline

  1. investigating May 15, 2024, 08:54 AM UTC

    We are currently investigating Network disruption that is causing intermittent availability issues with the SuperOffice CRM Cloud

  2. monitoring May 15, 2024, 09:02 AM UTC

    A fix has been implemented and we are monitoring the results.

  3. resolved May 15, 2024, 11:27 AM UTC

    This incident has been resolved.

  4. postmortem May 29, 2024, 02:48 PM UTC

    **Date:** May 15th **Start Time:** 10:40 AM **End Time:** 11:20 AM **Impact:** Webserver performance degradation, Authentication service instability, and user login problems. ## Summary On the morning of May 15th, an upgrade to Antivirus software was initiated across all servers at 10:30 AM. This routine maintenance task unexpectedly resulted in a high load on all services. The most significant impact was observed on our authentication services, which experienced such heavy load that it led to a noticeable slowness in all user login attempts. ## Timeline * **10:30 AM:** Antivirus software upgrade commenced. * **10:42 AM:** Increased load on services detected. * **10:45 AM:** Authentication services began showing signs of slowness. * **11:20 AM:** Services restored ## Root Cause Analysis The root cause of the incident was identified as the simultaneous upgrade of Antivirus software on all servers, which created an unexpected surge in resource consumption. This surge exceeded the anticipated load and was not accounted for in our capacity planning. The authentication services, being critical to user access, were hit hardest due to their vital role in the system's operation. ## Resolution and Recovery Upon identifying the issue, the response team took immediate action to mitigate the impact: 1. Prioritized resources for authentication services to alleviate the load. 2. Monitored the system closely until all services stabilized. By 11:20 AM, the system had normalized, and all services were fully operational. **Corrective Measures:** To prevent future occurrences of this nature, we are: * Implementing additional monitoring alerts for early detection of abnormal load patterns. * Reviewing our change management procedures to ensure better handling of critical infrastructure updates. * Conducting a thorough investigation to understand the interdependencies between service components during maintenance tasks. We apologize for any inconvenience caused and appreciate your understanding as we continuously strive to improve our services.