CodeTwo incident

[North Europe] Email processing delays due to Microsoft power issues

CodeTwo experienced a minor incident on March 23, 2020 affecting Mail flow, lasting 33m. The incident has been resolved; the full update timeline is below.

Started: Mar 23, 2020, 04:11 PM UTC
Resolved: Mar 23, 2020, 04:45 PM UTC
Duration: 33m
Detected by Pingoru: Mar 23, 2020, 04:11 PM UTC

Affected components

Mail flow

Update timeline

investigating Mar 23, 2020, 04:11 PM UTC

We are currently investigating this issue.
investigating Mar 23, 2020, 04:23 PM UTC

We are continuing the investigate the issue with network quality within Microsoft datacenters in North Europe. A subset of users may experience delayed email delivery. Signatures are added correctly.
identified Mar 23, 2020, 04:29 PM UTC

The issue has been identified and a fix is being implemented.
monitoring Mar 23, 2020, 04:36 PM UTC

The issue has been mitigated. All emails are now delivered with no delays. We're monitoring the services actively to see if everything is working correctly. We’re also working with Microsoft Premier Support Team find out what caused the problem.
resolved Mar 23, 2020, 04:45 PM UTC

The incident has been resolved. An RCA will be provided later. Please accept our apologies for the problem.
postmortem Mar 26, 2020, 11:09 AM UTC

Email delivery delays that affected a subset of users in North Europe were caused by power issues in the Microsoft datacenter in Dublin. Even though the servers where CodeTwo services are hosted were not directly impacted, most of our clusters were side-impacted by the outage, as the connectivity to all the clusters in this region was experiencing stress or downtime. Our high availibility secondary services in this region were partially impacted as well, which led to email processing delays for some tenants and created bottlenecks in the mail transport pipeline before our failover systems hosted on unaffected nodes kicked in to mitigate the problem for affected users. Our failover services mitigated the problem completely within minutes. When the entire datacenter was fully operational, we switched back to primary services. For more information, please read the RCA provided by Microsoft: _**Incident Summary:**_ _Between 15:40 and 16:20 UTC on 23 Mar 2020, a subset of customers North Europe may have seen errors connecting to resources hosted in this region._ _**Root cause:**_ _During an electrical switching procedure that was being performed on a construction site that shares utility power with one of our operational datacenters, an incorrect process was followed. Due to this improper switching, a large voltage sag was seen by our operational datacenter. While there was no loss of power to server racks, the event led to a subset of servers within a single storage scale unit to experience a reboot event. The rebooting of the various servers led to some of the region’s Storage subscriptions and their associated Azure services to be unreachable while the systems recovered._ **Mitigation:** _As this was a transient power sag event, the Storage servers were allowed to automatically recover._ _We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes \(but is not limited to\):_ * _Evaluate server hardware to determine the cause of rebooting._ * _Partner with the construction company to ensure that they understand the impact they caused and they take steps to ensure that all electrical work on the shared utility service follows correct procedures._ _We apologize for any inconvenience this may have caused._