SuperOffice incident

Performance issue in login service. Some users are unable to connect.

SuperOffice experienced a major incident on May 14, 2024, lasting 34m. The incident has been resolved; the full update timeline is below.

Started: May 14, 2024, 07:41 AM UTC
Resolved: May 14, 2024, 08:15 AM UTC
Duration: 34m
Detected by Pingoru: May 14, 2024, 07:41 AM UTC

Update timeline

investigating May 14, 2024, 07:41 AM UTC

We are currently investigating this issue.
identified May 14, 2024, 07:49 AM UTC

The issue has been identified and a fix is being implemented.
monitoring May 14, 2024, 07:58 AM UTC

A fix has been implemented and we are monitoring the results.
resolved May 14, 2024, 08:15 AM UTC

This incident has been resolved.
postmortem May 29, 2024, 01:33 PM UTC

**Date:** May 14th **Start Time:** 09:15 AM **End Time:** 10:40 AM **Impact:** Webserver performance degradation, Authentication service instability, and user login problems. ## Summary On May 14th, an update to our security infrastructure involving the renewal of a SSL certificate on the frontend load balancer inadvertently triggered a rebalancing process across backend load balancers. This unexpected behavior led to an excessive load on our authentication service, rendering it unresponsive and preventing all users from accessing their work. **Timeline of Events:** * **Renewal of SSL Certificate:** The certificate was successfully renewed on the front-end load balancer. * **Unintended Consequences:** Subsequent to the renewal, the backend load balancer initiated a rebalance of customer loads. * **Authentication Service Overload:** The rebalance resulted in a heavy load on the authentication service, leading to a system-wide inability for user operations. * **Resolution:** The issue was addressed by performing a restart of the authentication service. * **Service Restoration:** Normal service functionality was restored at approximately 10:50 AM. **Corrective Measures:** To prevent future occurrences of this nature, we are: * Implementing additional monitoring alerts for early detection of abnormal load patterns. * Reviewing the change-management procedures to ensure better handling of critical infrastructure updates. * Conducting a thorough investigation to understand the interdependencies between service components during maintenance tasks. We apologize for any inconvenience caused and appreciate your understanding as we continuously strive to improve our services.