Auvik incident

Service Disruption - Clients on the US4 cluster are receiving 500 errors when trying to access their sites

Auvik experienced a major incident on October 14, 2024 affecting us4.my.auvik.com, lasting 2h 17m. The incident has been resolved; the full update timeline is below.

Started: Oct 14, 2024, 11:46 AM UTC
Resolved: Oct 14, 2024, 02:03 PM UTC
Duration: 2h 17m
Detected by Pingoru: Oct 14, 2024, 11:46 AM UTC

Affected components

us4.my.auvik.com

Update timeline

investigating Oct 14, 2024, 11:46 AM UTC

We’re experiencing disruption with client sites on the US4 Cluster. When they try to access their sites, they receive 500 errors. We will continue to provide updates as they become available.
identified Oct 14, 2024, 12:06 PM UTC

We’re experiencing disruption with client sites on the US4 Cluster. When they try to access their sites, they receive 500 errors. Auvik requires an emergency cluster restart of the US4. This action will take a half-hour. Tenants on the US4 cluster are expected to start recovering after the restart. All sites are to be fully functional after 1.5 hours from the restart
monitoring Oct 14, 2024, 12:40 PM UTC

We’ve identified the source of the service disruption with sites on the US4 cluster. We have performed an emergency cluster restart and are monitoring the situation. Sites on the cluster are recovering, and we anticipate all sites will be up and running by 14:00 UTC. We’ll keep you posted on a resolution.
resolved Oct 14, 2024, 02:03 PM UTC

The disruption of client sites on the US4 cluster, receiving 500 errors when they tried to access their sites, has been resolved. Services have been restored. There are a few large client sites still verifying that should resolve shortly. A Root Cause Analysis (RCA) will follow after a full review.
postmortem Nov 01, 2024, 01:50 PM UTC

# Service Disruption ## Backend Resource Strain and Service Disruption over a multiple-day period ### Root Cause Analysis ### Duration of incident Discovered: Oct 07, 2024 09:56 - UTC Resolved: Oct 07, 2024 19:00 - UTC Discovered: Oct 14, 2024 10:55- UTC Resolved: Oct 14, 2024 14:00 - UTC Discovered: Oct 16, 2024 05:42 - UTC Resolved: Oct 17, 2024 13:37 - UTC ### **Cause** The primary cause of this multi-day incident was a combination of backend instability and resource management challenges triggered by technical bugs and configuration issues. Specifically, a non-thread-safe map in the Autotask integration led to excessive CPU consumption, compounded by frequent tenant migrations and high memory usage across multiple clusters. Excessive API requests through the Web Application Firewall \(WAF\) and misconfigurations further strained backend resources, resulting in widespread service disruptions and extended recovery time. ### **Effect** The incident significantly impacted service availability and performance across multiple clusters. Users experienced frequent 500 and 504 errors, delays in accessing tenant data, and slow UI loading times. The high CPU usage and backend instability led to tenant migrations and disrupted connectivity, causing certain features to become intermittently unavailable. Additionally, the ongoing backend strain increased support cases and required multiple restarts and resource reallocations, prolonging the disruption and leading to a degraded experience for affected users over several days. ### **Action taken** _All times in UTC_ **10/07/2024** **Initial Detection and Escalation** **09:56 - 10:02** Key symptoms identified: * High heap usage across multiple backends. * Communication failures between nodes in clusters CA1 and US1, causing tenant access issues. * Multiple tenants are stuck in a verifying state. **10:20 - 11:30** Escalated mitigations: Decided to restart CA1, followed by US1, to address node communication issues. The status page is updated to notify users of ongoing disruptions. **12:19 - 13:06** Status recap and monitoring of ongoing issues, including: Continued high heap usage. Tenant availability errors \(504s\) due to lost seed nodes. Investigation of tenant verification issues. **14:00 - 19:00** Work continues on the instability of model investigations and backend performance issues, with some partial fixes applied. **19:00** Temporary workaround applied to stabilize model flapping. **10/14/2024** **Continued Investigation and Remediation** **11:30** Focused mitigation for US4 clients to stabilize tenant access and service performance. **14:00** Affected sites and tenants restarted, resolving some availability issues. **10/16/2024** **Addressing WAF and High CPU Issues** **17:15** WAF mitigation steps taken, blocking excessive requests from specific IPs. **18:31** WAF issues confirmed resolved after blocking IPs responsible for high traffic. **10/16/2024** **High CPU Issues and Tenant Rebalancing** **10:55 - 11:25** High CPU usage detected on multiple backends: Affect backends are capped, restarted, and drained to mitigate load. **12:12 - 12:29** Specific tenant issues, including problematic tenants, were identified, which triggered frequent backend moves and further resource strain. **15:00 - 18:00** Troubleshooting and tenant isolation continue; problematic tenants are isolated, and partial recovery is achieved. **10/17/2024** **Root Cause Fixes and Final Resolution** **10:35** Further diagnosis identifies the root cause in the non-thread-safe map, leading to high CPU usage. **13:27** A short-term fix was applied to stabilize the problematic tenant and manage resource allocation. **13:37** Confirmed complete restoration of affected tenants and systems. ### **Future consideration\(s\)** * Auvik has installed a repair for the model identification instability. * Auvik has implemented a repair address for tenants stuck in a verifying state who cannot locate their tenant manager. * Auvik has implemented a fix to prevent the identified third-party integration from locking CPU processes, which will cause the backend to fail due to high resource consumption. * Auvik has installed a fix to prevent long device names from causing continual tent failures across backends. * Auvik has added enhanced monitoring for excessive backend tenant failures.