CodeTwo incident

[North Europe] Degraded performance of signature management apps

CodeTwo experienced a minor incident on September 7, 2022 affecting Signature management (app.codetwo.com), lasting 6h 7m. The incident has been resolved; the full update timeline is below.

Started: Sep 07, 2022, 11:45 AM UTC
Resolved: Sep 07, 2022, 05:52 PM UTC
Duration: 6h 7m
Detected by Pingoru: Sep 07, 2022, 11:45 AM UTC

Affected components

Signature management (app.codetwo.com)

Update timeline

investigating Sep 07, 2022, 11:45 AM UTC

We are currently investigating this issue.
identified Sep 07, 2022, 11:49 AM UTC

Currently a subset of users in North Europe might be unable to manage signatures using app.codetwo.com. We're investigating this problem now. All emails are imprinted and delivered as normal - only designing and creating new signatures is temporarily not possible.
identified Sep 07, 2022, 12:10 PM UTC

Microsoft confirmed they are currently investigating the problem with Cosmos DB in this region. This problem limits your access to app.codetwo.com and affects the performance of Autoresponder, which might result in auto-responses being not sent for some users. We're actively working with Microsoft to make sure they fix the problem ASAP. All signatures are imprinted normally.
identified Sep 07, 2022, 12:28 PM UTC

We are continuing to work on a fix for this issue.
identified Sep 07, 2022, 12:29 PM UTC

Microsoft wrote: Current Status: We are currently investigating a potential root cause and are exploring mitigation options. We will provide updates in 60 minutes or as events warrant.
identified Sep 07, 2022, 01:59 PM UTC

We are continuing to work on a fix for this issue.
monitoring Sep 07, 2022, 03:58 PM UTC

A fix has been implemented and we are monitoring the results.
monitoring Sep 07, 2022, 05:27 PM UTC

We are continuing to monitor for any further issues.
resolved Sep 07, 2022, 05:52 PM UTC

This incident has been resolved.
postmortem Sep 12, 2022, 01:14 PM UTC

Microsoft posted a preliminary RCA regarding this incident: _**What happened?**_ _Between 09:50 UTC and 17:21 UTC on 07 Sep 2022, a subset of customers using Azure Cosmos DB in North Europe may have experienced issues accessing services. Connections to Cosmos DB accounts in this region may have resulted in an error or timeout._ _Downstream Azure services that rely on Cosmos DB also experienced impact during this window - including Azure Communication Services, Azure Data Factory, Azure Digital Twins, Azure Event Grid, Azure IoT Hub, Azure Red Hat OpenShift, Azure Remote Rendering, Azure Resource Mover, Azure Rights Management, Azure Spatial Anchors, Azure Synapse, and Microsoft Purview._ _**What went wrong and why?**_ _Cosmos DB load balances workloads across its infrastructure, within frontend and backend clusters. Our frontend load balancing procedure had a regression that did not factor in the effect of a reduction in available cluster capacity, due to ongoing maintenance. This surfaced during an ongoing platform maintenance event in one of the frontend clusters in the North Europe region, causing the availability issues described above._ _**How did we respond?**_ _Our monitors alerted us of the impact on this cluster. We ran two workstreams in parallel – one focused on identifying the reason for the issues themselves, while one focused on mitigating the customer impact. To mitigate, we load balanced off the impacted cluster by moving customer accounts to healthy clusters within the region._ _Given the volume of accounts we had to migrate, it took us time to safely load balance accounts – we had to analyze the state of each account individually, then systematically move each to an alternative healthy cluster in North Europe. This load balancing operation allowed the cluster to recover to a healthy operating state._ _Although we have the ability to mark a Cosmos DB region as offline \(which would trigger automatic failover activities, for customers using multiple regions\) we decided not to do that during this incident – as the majority of the clusters \(and therefore customers\) in the region were unimpacted._ _**How are we making incidents like this less likely or less impactful?**_ _Already completed:_ * _Fixed the regression in our load balancer procedure, to safely factor in capacity fluctuations during maintenance._ _In progress:_ * _Improving our monitoring and alerting to detect these issues earlier and apply pre-emptive actions. \(Estimated completion: October 2022\)_ * _Improving our processes to reduce the impact time with a more structured manual load balancing sequence during incidents. \(Estimated completion: November 2022\)_