Umbrellar incident

Umbrellar Cloud/Azure Stack Performance Issue

Critical Resolved View vendor source →

Umbrellar experienced a critical incident on January 9, 2020 affecting NZ North and Management Portal, lasting 2h 5m. The incident has been resolved; the full update timeline is below.

Started
Jan 09, 2020, 09:09 PM UTC
Resolved
Jan 09, 2020, 11:14 PM UTC
Duration
2h 5m
Detected by Pingoru
Jan 09, 2020, 09:09 PM UTC

Affected components

NZ NorthManagement Portal

Update timeline

  1. investigating Jan 09, 2020, 09:09 PM UTC

    We are currently experiencing performance degradation with our Azure Stack environment. Outages and connectivity time-outs to the services hosted on Azure stack as well as the management portal (https://portal.nznorth.cloud.umbrellar.io/) have been identified and reported. The Umbrellar Support Engineers are engaged with Microsoft to resolve the issue. We will post further updates every 30 minutes.

  2. identified Jan 09, 2020, 09:41 PM UTC

    The Umbrellar Support Engineers with Microsoft have identified a possible route cause and are working on a resolution.

  3. identified Jan 09, 2020, 10:11 PM UTC

    The Umbrellar Support Engineers are still working with Microsoft on a resolution. We will endeavor to provide additional detail in 30 minutes.

  4. identified Jan 09, 2020, 10:35 PM UTC

    Azure Stack Portal (https://portal.nznorth.cloud.umbrellar.io/) has been restored and is now accessible. The Engineering Team is busy checking all other affected systems to confirm functionality and resolve any issues remaining.

  5. identified Jan 09, 2020, 10:42 PM UTC

    We are continuing to work on a fix for this issue.

  6. monitoring Jan 09, 2020, 11:10 PM UTC

    The Umbrellar Support Team are continuing to monitor all systems and have begun gathering diagnostic information for Root Cause Analysis

  7. resolved Jan 09, 2020, 11:14 PM UTC

    This incident has been resolved.

  8. postmortem Feb 17, 2020, 08:29 PM UTC

    Second and final update to the **Umbrellar Cloud/Azure Stack Performance Issue** ‌ MS encountered some **networkprobe** warnings. The probe usually indicates that one of the machines either the source or target was broken in some way. So maxing out the CPU, Memory or disk. Port leaks can also cause this particular issue. MS also took a look at performance counters during the issue, and we saw spikes in CPU and Memory. So we think that the fact that machines were spiking, created all the issues. So by restarting these VM’s, the problem was solved. ‌ Thank you