Auvik incident

Service Disruption - EU1 cluster is experiencing an outage

Critical Resolved View vendor source →

Auvik experienced a critical incident on April 20, 2024 affecting eu1.my.auvik.com, lasting 2h 5m. The incident has been resolved; the full update timeline is below.

Started
Apr 20, 2024, 02:04 PM UTC
Resolved
Apr 20, 2024, 04:10 PM UTC
Duration
2h 5m
Detected by Pingoru
Apr 20, 2024, 02:04 PM UTC

Affected components

eu1.my.auvik.com

Update timeline

  1. investigating Apr 20, 2024, 02:04 PM UTC

    We’re experiencing an outage on the EU1 cluster. Customers will be unable to access their sites at this time. We will continue to provide updates as they become available

  2. identified Apr 20, 2024, 03:04 PM UTC

    We’ve identified the source of the service disruption to EU1. Sites continue to be down at this time. We are working to apply changes and restore service as quickly as possible.

  3. monitoring Apr 20, 2024, 03:38 PM UTC

    We’ve identified the source of the service disruption and applied a fix. Sites are starting, and we are monitoring to ensure all systems are functional.

  4. resolved Apr 20, 2024, 04:10 PM UTC

    The source of the disruption has been resolved, and services have been fully restored.

  5. postmortem Apr 29, 2024, 04:32 PM UTC

    # Service Disruption - EU1 Customers Experienced an Outage Following the April 20, 2024, Upgrade ## Root Cause Analysis ### Duration of incident Discovered: Apr 20, 2024, 12:36 - UTC Resolved: Apr 20, 2024, 16:10 - UTC ### Cause A scheduled upgrade was performed on the EU1 cluster to address software requirements for performance and security improvements. ### Effect Scheduled processes would not run, and network connectivity issues were experienced for clients on the EU1 cluster. ### Action taken _All times in UTC_ **04/20/2024** **10:31 -** Planned upgrade occurring during scheduled maintenance. **12:36 -** Issues from the upgrade are detected. **12:50 -** Initial mitigation to address issues taken. **13:00 -** Initial mitigation step deems insufficient. Investigation for the next steps started. **13:34 -** Additional mitigation steps implemented. **14:15 -** The concluding steps to address disruption taken by engineering to clear out the failed upgrade. **15:10 -** EU1 cluster and clients appear to be recovering. **16:00 -** The old data is cleared from the internal pods. **16:10 -** The incident is declared resolved. ### Future consideration\(s\) * The order of operations list will be reviewed and standardized for upgrades to part of the Auvik product. * Ensure that the Subject Matter Expert \(SME\) approval has been signed off on and that an SME is available when pertinent upgrades are scheduled. * Enforce the preferred roll-back processes where upgrades to the product are implemented.