Umbrellar incident
Umbrellar Cloud/Azure Stack Performance Issue
Umbrellar experienced a critical incident on January 9, 2020 affecting NZ North and Management Portal, lasting 2h 5m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Jan 09, 2020, 09:09 PM UTC
We are currently experiencing performance degradation with our Azure Stack environment. Outages and connectivity time-outs to the services hosted on Azure stack as well as the management portal (https://portal.nznorth.cloud.umbrellar.io/) have been identified and reported. The Umbrellar Support Engineers are engaged with Microsoft to resolve the issue. We will post further updates every 30 minutes.
- identified Jan 09, 2020, 09:41 PM UTC
The Umbrellar Support Engineers with Microsoft have identified a possible route cause and are working on a resolution.
- identified Jan 09, 2020, 10:11 PM UTC
The Umbrellar Support Engineers are still working with Microsoft on a resolution. We will endeavor to provide additional detail in 30 minutes.
- identified Jan 09, 2020, 10:35 PM UTC
Azure Stack Portal (https://portal.nznorth.cloud.umbrellar.io/) has been restored and is now accessible. The Engineering Team is busy checking all other affected systems to confirm functionality and resolve any issues remaining.
- identified Jan 09, 2020, 10:42 PM UTC
We are continuing to work on a fix for this issue.
- monitoring Jan 09, 2020, 11:10 PM UTC
The Umbrellar Support Team are continuing to monitor all systems and have begun gathering diagnostic information for Root Cause Analysis
- resolved Jan 09, 2020, 11:14 PM UTC
This incident has been resolved.
- postmortem Feb 17, 2020, 08:29 PM UTC
Second and final update to the **Umbrellar Cloud/Azure Stack Performance Issue** MS encountered some **networkprobe** warnings. The probe usually indicates that one of the machines either the source or target was broken in some way. So maxing out the CPU, Memory or disk. Port leaks can also cause this particular issue. MS also took a look at performance counters during the issue, and we saw spikes in CPU and Memory. So we think that the fact that machines were spiking, created all the issues. So by restarting these VM’s, the problem was solved. Thank you