Auvik incident

Service Disruption - US4

Major Resolved View vendor source →

Auvik experienced a major incident on February 24, 2025 affecting us4.my.auvik.com, lasting 1h 47m. The incident has been resolved; the full update timeline is below.

Started
Feb 24, 2025, 05:45 PM UTC
Resolved
Feb 24, 2025, 07:32 PM UTC
Duration
1h 47m
Detected by Pingoru
Feb 24, 2025, 05:45 PM UTC

Affected components

us4.my.auvik.com

Update timeline

  1. investigating Feb 24, 2025, 05:45 PM UTC

    Affected Services: Clients on the US4 cluster Service not impacted: Clients other clusters Description: We are experiencing degraded performance with tenants on the US4 cluster. Our team is actively investigating the root cause and working to resolve the issue as quickly as possible. Impact: Users will experience issues with connectivity to their tenants Services: Other clusters are not experiencing issues Next Steps: We will provide updates as more information becomes available or by 18:30 UTC. Thank you for your patience as we work to restore full functionality.

  2. identified Feb 24, 2025, 05:49 PM UTC

    Affected Services: Clients on the US4 cluster Service not impacted: Clients other clusters Description: Our team has identified the root cause of the degraded performance affecting tenants on the US4 cluster and is currently investigating a solution to restore normal service levels. Impact: Users will experience issues with connectivity to their tenants Services: Other clusters are not experiencing issues Next Steps: Our team is actively working to resolve the issue and will provide updates as progress is made or by 18:30 UTC Thank you for your patience as we work to restore full functionality.

  3. identified Feb 24, 2025, 06:29 PM UTC

    Affected Services: Clients on the US4 cluster Service not impacted: Clients other clusters Description: Our team has identified the root cause of the degraded performance with tenants on the US4 cluster. We are seeing tenants becoming available to normal service levels. Impact: While we work on the resolution, users start to see their tenants become responsive, Services: Other clusters are not impacted. Next Steps: Our team is actively working to resolve the issue and will provide updates as progress is made or by 19:30 -UTC. Thank you for your patience as we work to restore full functionality.

  4. monitoring Feb 24, 2025, 07:00 PM UTC

    Affected Services: Clients on the US4 cluster Service not impacted: Clients other clusters Description: Our team has fixed the issue affecting tenants' inaccessibility on the US4 cluster. The remaining tenants are recovering. We are monitoring the situation to ensure stability and confirm that the service remains fully functional. Impact: Service should operate normally; some tenant sites are still becoming accessible. Services: sites on other clusters are not affected Next Steps: We will provide a final update once the issue is resolved. Thank you for your patience, and we apologize for any inconvenience caused.

  5. resolved Feb 24, 2025, 07:32 PM UTC

    Affected Services: Tenants on US4 Services not impacted: Tenants on all other clusters Description: The issue affecting tenant inaccessibility on the US4 cluster has been fully resolved. Regular service has been restored, and all systems are now operating as expected. Impact: Users should no longer experience any issues related to this incident. Next Steps: We are preparing a detailed Root Cause Analysis (RCA) report to provide further insights into the incident and preventive measures. Thank you for your patience, and we apologize for any inconvenience caused.

  6. postmortem Mar 11, 2025, 02:00 PM UTC

    # Service Disruption - Clients on the US4 Cluster Unreachable ## Root Cause Analysis ### Duration of incident Discovered: Feb 28, 2025 Time - 16:32 - UTC Resolved: Feb 28, 2025 Time - 19:30- UTC ### Cause Overload of backend resources for services on the US4 cluster. ### Effect Tenants on the US4 cluster became inaccessible. ### Action taken _All times in UTC_ **02/28/2025** **16:32** - Auvik Engineering discovers several non-responsive backends on the US4 cluster, which causes some tenants to be unresponsive. Engineering begins investigating. **17:00** - Attempts are made to revive the non-responsive backends. **17:28** - Cluster is in distress, with more backends starting to fail. **17:45** - Engineering restarts the entire cluster. **18:10-19:30** - The cluster is observed as it restarts and monitored as it comes up to full functionality. The incident is declared resolved. ### Future consideration\(s\) * Auvik is currently improving backend monitoring and stability within the product and infrastructure. These improvements aim to help mitigate potential issues proactively in the future.