Umbrellar incident

Umbrellar Cloud/Azure Stack Performance Issue

Umbrellar experienced a critical incident on January 9, 2020 affecting NZ North and Management Portal, lasting 2h 5m. The incident has been resolved; the full update timeline is below.

Started: Jan 09, 2020, 09:09 PM UTC
Resolved: Jan 09, 2020, 11:14 PM UTC
Duration: 2h 5m
Detected by Pingoru: Jan 09, 2020, 09:09 PM UTC

Affected components

NZ NorthManagement Portal

Update timeline

investigating Jan 09, 2020, 09:09 PM UTC

We are currently experiencing performance degradation with our Azure Stack environment. Outages and connectivity time-outs to the services hosted on Azure stack as well as the management portal (https://portal.nznorth.cloud.umbrellar.io/) have been identified and reported. The Umbrellar Support Engineers are engaged with Microsoft to resolve the issue. We will post further updates every 30 minutes.
identified Jan 09, 2020, 09:41 PM UTC

The Umbrellar Support Engineers with Microsoft have identified a possible route cause and are working on a resolution.
identified Jan 09, 2020, 10:11 PM UTC

The Umbrellar Support Engineers are still working with Microsoft on a resolution. We will endeavor to provide additional detail in 30 minutes.
identified Jan 09, 2020, 10:35 PM UTC

Azure Stack Portal (https://portal.nznorth.cloud.umbrellar.io/) has been restored and is now accessible. The Engineering Team is busy checking all other affected systems to confirm functionality and resolve any issues remaining.
identified Jan 09, 2020, 10:42 PM UTC

We are continuing to work on a fix for this issue.
monitoring Jan 09, 2020, 11:10 PM UTC

The Umbrellar Support Team are continuing to monitor all systems and have begun gathering diagnostic information for Root Cause Analysis
resolved Jan 09, 2020, 11:14 PM UTC

This incident has been resolved.
postmortem Feb 17, 2020, 08:29 PM UTC

Second and final update to the **Umbrellar Cloud/Azure Stack Performance Issue** ‌ MS encountered some **networkprobe** warnings. The probe usually indicates that one of the machines either the source or target was broken in some way. So maxing out the CPU, Memory or disk. Port leaks can also cause this particular issue. MS also took a look at performance counters during the issue, and we saw spikes in CPU and Memory. So we think that the fact that machines were spiking, created all the issues. So by restarting these VM’s, the problem was solved. ‌ Thank you