CodeTwo incident

[Australia] Email processing delays

Minor Resolved View vendor source →

CodeTwo experienced a minor incident on June 27, 2024 affecting Mail flow, lasting 1h 6m. The incident has been resolved; the full update timeline is below.

Started
Jun 27, 2024, 02:09 AM UTC
Resolved
Jun 27, 2024, 03:15 AM UTC
Duration
1h 6m
Detected by Pingoru
Jun 27, 2024, 02:09 AM UTC

Affected components

Mail flow

Update timeline

  1. investigating Jun 27, 2024, 02:09 AM UTC

    We are currently investigating email processing issues in Australia. A subset of users may experience delayed email delivery. Email signatures are added normally. The next update will be provided in 30 minutes or as events warrant.

  2. identified Jun 27, 2024, 02:17 AM UTC

    It looks like Microsoft Exchange Online Protections’ performance is degraded at the moment which means it is unable to process messages sent from CodeTwo and other vendors in a timely manner. We have notified Microsoft Support about the problem. We are, however, seeing sings of recovery which might suggest the problem should be mitigated soon.

  3. identified Jun 27, 2024, 02:28 AM UTC

    Microsoft has just published on X a status about a major outage of Microsoft 365: https://x.com/msft365status/status/1806149130663649355?s=46 as well as in the Microsoft 365 Admin Center (Issue ID: MO805755). This is a Microsoft issue. Please keep monitoring the communication from Microsoft to stay up to date with the mitigation steps during this outage.

  4. monitoring Jun 27, 2024, 02:56 AM UTC

    We can see the situation has improved significantly. The queues are almost gone. Microsoft has just reported: “We determined that a recent change within Azure networking infrastructure led to impact. We reverted this change and we're monitoring our telemetry to ensure that affected services recover as expected.” We will continue to actively monitor the situation.

  5. resolved Jun 27, 2024, 03:15 AM UTC

    This incident has been resolved. Delayed emails have been delivered with signatures. For more information about the incident with Microsoft 365, please refer to the Microsoft 365 Admin Center (Incident ID: MO805755). All CodeTwo services are operational, emails are delivered without any delays and signatures are added as normal.

  6. postmortem Jul 01, 2024, 03:54 PM UTC

    Below you can find the root cause analysis \(RCA\) from the Post Incident Report provided by Microsoft regarding this incident \(full report available in Microsoft 365 admin center, ID: MO805755\). In short, a change in a routing policy led to a configuration issue within Microsoft’s network routing infrastructure, causing impact to multiple Microsoft 365 services in the Asia-Pacific and Australia region. The incorrect routing policy that caused the outage was rolled back on June 27, 2024, at 2:02 AM UTC. All CodeTwo services remained healthy during the incident and all delayed emails have been delivered with signatures. **RCA FROM MICROSOFT’S POST INCIDENT REPORT \(Microsoft 365 admin center Issue ID: MO805755\):** _**Scope of Impact**_ _This issue could have impacted users globally, however, was mostly experienced by users hosted within Australia and Asia-Pacific due to the timeframe of impact overlapping with core business hours in those regions._ _**Incident Start Date and Time**_ _Thursday, June 27, 2024, at 1:18 AM UTC_ _**Incident End Date and Time**_ _Thursday, June 27, 2024, at 12:30 PM UTC_ _**Root Cause**_ _We’ve determined that a recent change caused a configuration issue within our network routing infrastructure, causing impact to multiple Microsoft 365 services._ _Specifically, in preparation for a planned network upgrade project, a change was made to our automation procedures supporting the upgrade. This change caused the automation to generate an incorrect routing policy that was not captured by our safety test systems in pre-checks for the project. Our WAN consists of two different planes for redundancy. When this incorrect routing policy was applied in the production network, a very large volume of traffic that is usually routed over plane one was routed over to plane two, and then sent back over plane one to reach its destination. This not only induced a very large latency increase, but it also caused congestion on both planes. The incorrect routing policy that caused the outage was rolled back on June 27, 2024, at 2:02 AM UTC._ _Due to the severity of the incident, several Microsoft 365 services experienced sustained impact and required further intervention to reach full recovery, which was attained on June 27, 2024, at 12:30 PM UTC._ _Extended impact for Exchange Online_ _A component used by the Exchange Online frontend proxy service became stuck due to the network conditions, preventing it from recovering once the network conditions were restored. Subsequently, manual recovery interventions were required due to code architecture patterns and the temporary exhaustion of automated recovery actions during the initial impact._