CodeTwo incident

[Global] Users are unable to log in to Admin Panel

CodeTwo experienced a minor incident on May 5, 2024 affecting Mail flow and Mail flow and 1 more component, lasting 1h. The incident has been resolved; the full update timeline is below.

Started: May 05, 2024, 02:32 PM UTC
Resolved: May 05, 2024, 03:33 PM UTC
Duration: 1h
Detected by Pingoru: May 05, 2024, 02:32 PM UTC

Affected components

Mail flowMail flowMail flowMail flowMail flowMail flowMail flowMail flowMail flowSignature adding (rules processing)

Update timeline

investigating May 05, 2024, 02:32 PM UTC

We are investigating an issue with CodeTwo Admin Panel logon issues for some users. It looks like Microsoft Azure Active Directory B2C services are down, which means users are unable to log in to CodeTwo Admin Panel. Logging in to CodeTwo Manage Signatures App is available, so if you want to manage signature or autoresponder rules, just log in to CodeTwo from this website: https://app.codetwo.com. All other services are operational. Signatures are added as normal.
investigating May 05, 2024, 03:12 PM UTC

Microsoft is aware of the problem on their end and is currently investigating it. If you want to manage signature or autoresponder rules, just log in to CodeTwo from this website: https://app.codetwo.com. Apart from Admin Panel (https://login.codetwo.com), all other services are operational. Signatures are added as normal.
monitoring May 05, 2024, 03:17 PM UTC

A fix has been implemented and we are monitoring the results.
resolved May 05, 2024, 03:33 PM UTC

This incident has been resolved.
postmortem May 13, 2024, 04:49 AM UTC

Below you will find the Post Incident Review \(PIR/RCA\), provided by Microsoft, regarding this incident. In short, Microsoft claims they failed to properly scale out their services and a high surge of traffic in a couple of regions lead to their servers run out of resources that caused Azure AD B2C availability issues and ultimately overloaded the entire service. The Azure AD B2C service is used by CodeTwo to handle some of our sign in options for our Customers, therefore our Customers were unable to sign in to CodeTwo Admin Panel during this incident. **MICROSOFT’S POST INCIDENT REVIEW \(Azure Portal ID: QTRM-RPZ\):** _**What happened?**_ _A platform issue caused some of our Azure Active Directory B2C customers in Europe to experience elevated failure rates for end users accessing B2C applications between 13:40 UTC on 5 May 2024 and 00:27 UTC on 6 May 2024. This issue impacted approximately 70% of Azure AD B2C customers in Europe, including one or more of your Azure subscriptions._ _**What went wrong and why?**_ _The incident was caused by an unusually sharp spike in requests to Azure Active Directory B2C service in North Europe and West Europe regions. We’re choosing not to characterize the high surge of traffic at this time. While the service was scaled out to handle the increase in load, this increase in traffic was much higher than the service could handle. As a result of the traffic spike, server machines started running out of resources to handle requests leading to an increase in failed requests._ _**How did we respond?**_ _At 13:44 UTC on 5 May 2024, Azure AD B2C monitoring detected a decrease in availability, and elevated failure rates for end users in North Europe and West Europe regions, so we began an investigation._ _While efforts to understand and remediate the issue began immediately, a combination of elevated traffic, conservative timeout limits, and excessive retries caused the system to become overloaded - leading to resource exhaustion and request failures. Due to the nature of the spike, the failures were happening in both the North Europe and West Europe regions where Azure Active Directory B2C is deployed within the Europe geography - and our mitigation option of failing over to other regions was not available to us._ _Since the failures were happening across multiple regions, we investigated if there was an incident in the services on which we are dependent. Once we confirmed there was not, we identified the cause of the resource exhaustion as a sudden spike in incoming traffic. Once this was understood, we deployed configuration changes to manage the pace of incoming traffic better, and to remove problematic requests more accurately. As a result of these configuration changes, we saw the resource utilization come down and the service start to recover. This configuration change led to partial recovery. Our team deployed an additional set of configuration changes to further reduce resource utilization._ _Once this next set of configuration changes was deployed, service recovered to full health as request queues on our infrastructure were cleared. All customer impact was confirmed as fully mitigated by 00:27 UTC on 6 May 2024._ _**How are we making incidents like this less likely or less impactful?**_ * _Our Azure AD B2C team has rolled out configuration changes to throttle the types of traffic patterns that caused this incident, and to improve timeouts and retry logic, in order to preserve service health for other customers. \(Completed\)_ * _Our Azure AD B2C team is improving automation that identifies customers with elevated end user error rates, so we can notify all potentially impacted customers of issues like this more quickly in future. \(Estimated completion: May 2024\)_ * _Our Azure AD B2C team is working on systematically reviewing failure modes throughout our system, and implementing changes to improve efficiency of managing resources under load. \(Estimated completion: June 2024\)_ * _Our Azure AD B2C team is working on a plan to improve the mitigation process and reduce the time taken to mitigate similar issues in the future. \(Estimated completion: June 2024\)_