Auvik incident

Service Disruption - Clients on the US3 cluster are receiving 500 errors when trying to access their sites

Auvik experienced a minor incident on October 16, 2024 affecting us3.my.auvik.com, lasting 58m. The incident has been resolved; the full update timeline is below.

Started: Oct 16, 2024, 09:59 AM UTC
Resolved: Oct 16, 2024, 10:58 AM UTC
Duration: 58m
Detected by Pingoru: Oct 16, 2024, 09:59 AM UTC

Affected components

us3.my.auvik.com

Update timeline

investigating Oct 16, 2024, 09:59 AM UTC

We’re experiencing disruption with client sites on the US3 Cluster. When they try to access their sites, they receive 500 errors. We will continue to provide updates as they become available.
identified Oct 16, 2024, 10:38 AM UTC

We’ve identified the source of the service disruption with client sites on the US3 Cluster. When they try to access their sites, they receive 500 errors. We are working to restore service as quickly as possible.
monitoring Oct 16, 2024, 10:51 AM UTC

We’ve identified the source of the service disruption with client sites on the US3 Cluster. When they try to access their sites, they receive 500 errors. We are implementing the fix and will keep you posted on a resolution.
resolved Oct 16, 2024, 10:58 AM UTC

The fix has been implemented for sites with 500 errors and inaccessible sites. The source of the disruption has been resolved, and services have been fully restored.
postmortem Nov 01, 2024, 01:50 PM UTC

# Service Disruption ## Backend Resource Strain and Service Disruption over a multiple-day period ### Root Cause Analysis ### Duration of incident Discovered: Oct 07, 2024 09:56 - UTC Resolved: Oct 07, 2024 19:00 - UTC Discovered: Oct 14, 2024 10:55- UTC Resolved: Oct 14, 2024 14:00 - UTC Discovered: Oct 16, 2024 05:42 - UTC Resolved: Oct 17, 2024 13:37 - UTC ### **Cause** The primary cause of this multi-day incident was a combination of backend instability and resource management challenges triggered by technical bugs and configuration issues. Specifically, a non-thread-safe map in the Autotask integration led to excessive CPU consumption, compounded by frequent tenant migrations and high memory usage across multiple clusters. Excessive API requests through the Web Application Firewall \(WAF\) and misconfigurations further strained backend resources, resulting in widespread service disruptions and extended recovery time. ### **Effect** The incident significantly impacted service availability and performance across multiple clusters. Users experienced frequent 500 and 504 errors, delays in accessing tenant data, and slow UI loading times. The high CPU usage and backend instability led to tenant migrations and disrupted connectivity, causing certain features to become intermittently unavailable. Additionally, the ongoing backend strain increased support cases and required multiple restarts and resource reallocations, prolonging the disruption and leading to a degraded experience for affected users over several days. ### **Action taken** _All times in UTC_ **10/07/2024** **Initial Detection and Escalation** **09:56 - 10:02** Key symptoms identified: * High heap usage across multiple backends. * Communication failures between nodes in clusters CA1 and US1, causing tenant access issues. * Multiple tenants are stuck in a verifying state. **10:20 - 11:30** Escalated mitigations: Decided to restart CA1, followed by US1, to address node communication issues. The status page is updated to notify users of ongoing disruptions. **12:19 - 13:06** Status recap and monitoring of ongoing issues, including: Continued high heap usage. Tenant availability errors \(504s\) due to lost seed nodes. Investigation of tenant verification issues. **14:00 - 19:00** Work continues on the instability of model investigations and backend performance issues, with some partial fixes applied. **19:00** Temporary workaround applied to stabilize model flapping. **10/14/2024** **Continued Investigation and Remediation** **11:30** Focused mitigation for US4 clients to stabilize tenant access and service performance. **14:00** Affected sites and tenants restarted, resolving some availability issues. **10/16/2024** **Addressing WAF and High CPU Issues** **17:15** WAF mitigation steps taken, blocking excessive requests from specific IPs. **18:31** WAF issues confirmed resolved after blocking IPs responsible for high traffic. **10/16/2024** **High CPU Issues and Tenant Rebalancing** **10:55 - 11:25** High CPU usage detected on multiple backends: Affect backends are capped, restarted, and drained to mitigate load. **12:12 - 12:29** Specific tenant issues, including problematic tenants, were identified, which triggered frequent backend moves and further resource strain. **15:00 - 18:00** Troubleshooting and tenant isolation continue; problematic tenants are isolated, and partial recovery is achieved. **10/17/2024** **Root Cause Fixes and Final Resolution** **10:35** Further diagnosis identifies the root cause in the non-thread-safe map, leading to high CPU usage. **13:27** A short-term fix was applied to stabilize the problematic tenant and manage resource allocation. **13:37** Confirmed complete restoration of affected tenants and systems. ### **Future consideration\(s\)** * Auvik has installed a repair for the model identification instability. * Auvik has implemented a repair address for tenants stuck in a verifying state who cannot locate their tenant manager. * Auvik has implemented a fix to prevent the identified third-party integration from locking CPU processes, which will cause the backend to fail due to high resource consumption. * Auvik has installed a fix to prevent long device names from causing continual tent failures across backends. * Auvik has added enhanced monitoring for excessive backend tenant failures.