Auvik incident
Service Degraded - Some Clients on the US4 cluster are offline.
Auvik experienced a minor incident on April 14, 2025 affecting us4.my.auvik.com, lasting 4h. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Apr 14, 2025, 10:25 PM UTC
Affected Services: Site availability Cluster(s): US4 Description: We are currently experiencing degraded performance with sites running on the US4 cluster. Our team is actively investigating the root cause and working to resolve the issue as quickly as possible. Impact: Users may experience connectivity issues with their tenants. Services: None of the other clusters and services are affected. Next Steps: We will provide updates as more information becomes available or within the next hour. Thank you for your patience as we work to restore full functionality.
- identified Apr 14, 2025, 10:55 PM UTC
Affected Services: Site availability Cluster(s): US4 Description: Our team has identified the root cause of the degraded performance affecting client site availability in the US4 cluster. We are currently investigating a solution to restore normal service levels. Impact: While we work on the resolution, users may experience connectivity issues as sites become available again. Services: None of the other clusters and services are affected. Next Steps: Our team is actively working to resolve the issue and will provide updates as progress is made or by 23:00 UTC Thank you for your patience as we work to restore full functionality.
- monitoring Apr 14, 2025, 11:40 PM UTC
Affected Services: Site availability Cluster(s): US4 Description: Our team has implemented a fix for the issue affecting site connectivity on the US4 cluster. We are waiting for the rest of the sites to be available online. We monitor the situation to ensure stability and confirm that the service remains fully functional. Impact: Services should be operating normally, except for the remaining site, which we are waiting for to become fully available. Services: None of the other clusters and services are affected.. Next Steps: We will provide a final update once all issues are resolved. Thank you for your patience, and we apologize for any inconvenience caused.
- monitoring Apr 15, 2025, 12:39 AM UTC
Affected Services: Site availability Cluster(s): US4 Description: Our team has implemented a fix for the issue affecting site connectivity on the US4 cluster. We are waiting for the rest of the sites to be available online. We monitor the situation to ensure stability and confirm that the service remains fully functional. Impact: Services should operate normally, except for the remaining sites, which we are continuing to work to make fully available. Services: None of the other clusters and services are affected.. Next Steps: We will provide a final update once all issues are resolved. Thank you for your patience, and we apologize for any inconvenience caused.
- monitoring Apr 15, 2025, 01:40 AM UTC
Affected Services: Site availability Cluster(s): US4 Description: Our team has implemented a fix for the issue affecting site connectivity on the US4 cluster. We are waiting for the rest of the sites to be available online. We monitor the situation to ensure stability and confirm that the service remains fully functional. Impact: Services should operate normally, except for the remaining sites, which we continue working to make fully available. Services: None of the other clusters and services are affected.. Next Steps: We will provide a final update once all issues are resolved. Thank you for your patience, and we apologize for any inconvenience caused.
- resolved Apr 15, 2025, 02:25 AM UTC
Affected Services: Site availability Cluster(s): US4 Description: The issue affecting site availability has been fully resolved. Regular service has been restored, and all systems are now operating as expected. Impact: Users should no longer experience issues related to this incident except for select clients we have communicated with. Next Steps: We are preparing a detailed Root Cause Analysis (RCA) report to provide further insights into the incident and preventive measures. Thank you for your patience, and we apologize for any inconvenience caused.
- postmortem Apr 17, 2025, 03:54 PM UTC
# Service Disruption - Over 50% of clients on the US4 cluster experienced service interruptions. ## Root Cause Analysis ### Duration of incident Discovered: Apr 14, 2025 19:45 UTC Resolved: Apr 15, 2025 04:05 UTC ### Cause A configuration change related to Meraki Devices. ### Effect About 55% of tenants in US4 became inaccessible due to increased traffic and system load. Action taken _All times are in UTC_ **04/14/2025** **19:45** - Auvik receives internal alerts for abnormal CPU usage on its backend systems for the US4 cluster. **19:50** - Engineering begins an investigation into the issue, actively taking measures to stabilize the system. **20:42** - A large number of sites become inaccessible, and Auvik implements its incident response. **20:42-21:45** - Engineering continues to investigate. **21:45** - A possible root cause of the issue is identified, and Engineering begins recovering sites. **04/14/25-04/15/25** **21:45 - 00:10** - Engineering continues to bring most of the affected sites back online. **04/15/25** **00:10** - All sites, except one client, are back up and accessible. **00:10-01:00** - Auvik continues to work on bringing the last client tenants online and getting them up and running. **01:00** - A root cause is determined for the cause of the incident. Engineering creates mitigation steps. **01:00-03:05** - Mitigation steps are implemented, and the remaining sites of the last client are brought online and accessible. ### Future consideration\(s\) * Auvik has implemented safeguards to prevent a recurrence.