Auvik incident
Service Disruption - US4 cluster is unreachable
Auvik experienced a critical incident on December 13, 2024 affecting us4.my.auvik.com, lasting 1h 4m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Dec 13, 2024, 05:18 PM UTC
Affected Services: US4 Cluster Description: We are currently experiencing an outage on tenants hosted on our US4 cluster. Our team is actively investigating the root cause and working to resolve the issue as quickly as possible. Impact: Users will not be able to reach their tenants hosted in US4. Next Steps: We will provide updates as more information becomes available or within the next hour. Thank you for your patience as we work to restore full functionality.
- investigating Dec 13, 2024, 05:51 PM UTC
We are continuing to investigate this issue.
- monitoring Dec 13, 2024, 05:59 PM UTC
Affected Services: US4 cluster Description: Our team has implemented a fix for the issue affecting the US4 cluster. Tenants are being restored and we are continuing to monitor the recovery progress. Impact: Any unreachable tenant is queued to be started and will be reachable within approximately 1 hour. Next Steps: We will provide a final update once we confirm the issue is fully resolved. Thank you for your patience, and we apologize for any inconvenience caused.
- resolved Dec 13, 2024, 06:23 PM UTC
Affected Services: US4 Cluster Description: The issue affecting US4 has been addressed and the system has recovered. Impact: Users should now be able to access their tenants on US4. Next Steps: We are preparing a detailed Root Cause Analysis (RCA) report to provide further insights into the incident and preventive measures. Thank you for your patience, and we apologize for any inconvenience caused.
- postmortem Jan 10, 2025, 03:29 PM UTC
# Service Disruption - Cluster US4 is unreachable for customers ## Root Cause Analysis ### Duration of incident Discovered: Dec 13, 2024 17:03 - UTC Resolved: Dec 13, 2024 18:23 - UTC ### Cause Routine maintenance tasks in preparation for the upcoming weekend's maintenance cause an unexpected load to the system. Effect The backend systems overwhelmed the systems on the US4 cluster, which caused a communication interruption with the tenants. ### Action taken _All times in UTC_ **12/13/2024** **16:57 -** Steps to prepare the system for the next day’s maintenance performed. **17:03 -** Tenants on the US4 cluster become unreachable. **17:09 -** The Auvik engineering team assembles stakeholders to investigate the service interruption. **17:25 -** The backend systems on the US4 cluster begin to recover independently. **17:39 -** Tenants begin to become reachable internally. **17:40 -** Tenants become visible in the UI. **17:57** - Engineering addressed tenants that are not coming back up gracefully. **18:23 -** Tenants on US4 have recovered. ### Future consideration\(s\) * Auvik has altered its preparation for scheduled maintenance, eliminating processes that could affect system performance in the future.