Auvik incident
Service Degraded - Internet Connection Checks are creating false alerts on the US3 cluster.
Auvik experienced a minor incident on March 31, 2025 affecting us3.my.auvik.com, lasting 20h 37m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Mar 31, 2025, 06:15 PM UTC
Affected Services: Internet Connection Service Cluster(s):US3 Description: We are currently experiencing degraded performance with the internet connection check ping service. Our team is actively investigating the root cause and working to resolve the issue as quickly as possible. Impact: Users may experience false internet connection disconnects. Services, including other monitoring services, are not impacted. Next Steps: We will update you as more information becomes available or within the next hour. Thank you for your patience as we work to restore full functionality.
- investigating Mar 31, 2025, 06:54 PM UTC
Affected Services: Internet Connection Service Cluster(s):US3 Description: We are currently experiencing degraded performance with the internet connection check ping service. Our team is actively investigating the root cause and working to resolve the issue as quickly as possible. Some clients may see the alerts associated with the Internet connection service missing from the alert dashboard. This is temporary, and these alerts will be restored when the internet connection check service is restored. Impact: Users may experience false internet connection disconnects. Services, including other monitoring services, are not impacted. Next Steps: We will update you as more information becomes available or by 20:00 UTC. Thank you for your patience as we work to restore full functionality.
- investigating Mar 31, 2025, 08:00 PM UTC
Affected Services: Internet Connection Service Cluster(s):US3 Description: We are currently experiencing degraded performance with the internet connection check ping service. Our team is actively investigating the root cause and working to resolve the issue as quickly as possible. Some clients may see the alerts associated with the Internet connection service missing from the alert dashboard. This is temporary, and these alerts will be restored when the internet connection check service is restored. Impact: Users may experience false internet connection disconnects. Services, including other monitoring services, are not impacted. Next Steps: We will update you as more information becomes available or by 21:00 UTC. Thank you for your patience as we work to restore full functionality.
- investigating Mar 31, 2025, 09:02 PM UTC
Affected Services: Internet Connection Service Cluster(s):US3 Description: We are currently experiencing degraded performance with the internet connection check ping service. Our team is actively investigating the root cause and working to resolve the issue as quickly as possible. Some clients may see the alerts associated with the Internet connection service missing from the alert dashboard. This is temporary, and these alerts will be restored when the internet connection check service is restored. Impact: Users may experience false internet connection disconnects. Services, including other monitoring services, are not impacted. Next Steps: We will update you as more information becomes available or by 22:00 UTC. Thank you for your patience as we work to restore full functionality.
- identified Mar 31, 2025, 09:57 PM UTC
Affected Services: Internet Connection Service Cluster(s):US3 Description: We are currently experiencing degraded performance with the internet connection check ping service. Our team is actively investigating the root cause and working to resolve the issue as quickly as possible. Some clients may see the alerts associated with the Internet connection service missing from the alert dashboard. This is temporary, and these alerts will be restored when the internet connection check service is restored. Impact: Users may experience false internet connection disconnects. Services, including other monitoring services, are not impacted. Next Steps: Auvik will disable alerts for clients on the US3 cluster for up to 1 hour starting at 22:00 UTC. This is a preventative measure as we work through false alerts for the internet connection checks. We apologize for the late notice. Thank you for your patience as we work to restore full functionality.
- identified Mar 31, 2025, 10:19 PM UTC
Affected Services: Internet Connection Service Cluster(s):US3 Description: We are currently experiencing degraded performance with the internet connection check ping service. Our team is actively investigating the root cause and working to resolve the issue as quickly as possible. Some clients may see the alerts associated with the Internet connection service missing from the alert dashboard. This is temporary, and these alerts will be restored when the internet connection check service is restored. Impact: Users may experience false internet connection disconnects. Services, including other monitoring services, are not impacted. Next Steps: Auvik will disable alerts for clients on the US3 cluster for up to 1 hour starting at 22:00 UTC. This is a preventative measure as we work through false alerts for the internet connection checks. Clients may experience a slowed UI response time during this work. This UI slowness should be very short if it is noticeable at all. We apologize for the late notice. Thank you for your patience as we work to restore full functionality.
- identified Mar 31, 2025, 10:59 PM UTC
Affected Services: Internet Connection Service Cluster(s):US3 Description: We are currently experiencing degraded performance with the internet connection check ping service. Our team is actively investigating the root cause and working to resolve the issue as quickly as possible. Some clients may see the alerts associated with the Internet connection service missing from the alert dashboard. This is temporary, and these alerts will be restored when the internet connection check service is restored. Impact: Users may experience false internet connection disconnects. Services, including other monitoring services, are not impacted. Next Steps: Auvik has disabled alerts for clients on the US3 cluster. This action will continue for an additional hour until 23:00 UTC. This is a preventative measure as we work through false alerts for the internet connection checks. Clients may experience a slowed UI response time during this work. Any UI slowness should be very short, if noticeable at all. We apologize for the extended window for this action. Thank you for your patience as we work to restore full functionality.
- identified Mar 31, 2025, 11:48 PM UTC
Affected Services: Internet Connection Service Cluster(s):All Clusters Description: We are currently experiencing degraded performance with the internet connection check ping service. Our team is actively investigating the root cause and working to resolve the issue as quickly as possible. Some clients may see the alerts associated with the Internet connection service missing from the alert dashboard. This is temporary, and these alerts will be restored when the internet connection check service is restored. Impact: Users may experience false internet connection disconnects. Services, including other monitoring services, are not impacted. Next Steps: Auvik will perform an emergency cluster restart on US3 tenants at 00:00, which will take approximately 1.5 hours to complete. At this time, Auvik will also perform a 20-minute maintenance window to allow for a restart of the Internet connection service for all of Auvik. We sincerely apologize for the extended window for this action. Thank you for your patience as we work to restore full functionality.
- identified Mar 31, 2025, 11:49 PM UTC
We are continuing to work on a fix for this issue.
- identified Apr 01, 2025, 12:22 AM UTC
Affected Services: Internet Connection Service Cluster(s):US3 Description: We are currently experiencing degraded performance with the internet connection check ping service. Our team is actively investigating the root cause and working to resolve the issue as quickly as possible. Some clients may see the alerts associated with the Internet connection service missing from the alert dashboard. This is temporary, and these alerts will be restored when the internet connection check service is restored. Impact: The 20-minute maintenance window for the internet connection service for all clusters has been completed. Services, including other monitoring services, are not impacted. Next Steps: The US3 cluster is still going through its restart process. We sincerely apologize for the extended window for this action. Thank you for your patience as we work to restore full functionality.
- monitoring Apr 01, 2025, 01:11 AM UTC
Affected Services: Internet Connection Service Cluster(s):US3 Description: Our team has implemented a fix for the issue affecting the Internet connection ping check for the tenants on the US3 cluster, and performance is returning to normal. We are currently monitoring the situation to ensure stability and confirm that the service remains fully functional. Impact: Services should be operating normally; however, we continue monitoring for irregularities. Next Steps: Tenants on the US3 cluster are still recovering and look healthy. Thank you for your patience, and we apologize for any inconvenience caused.
- monitoring Apr 01, 2025, 01:25 AM UTC
Affected Services: Internet Connection Service Cluster(s):US3 Description: Our team has implemented a fix for the issue affecting the Internet connection ping check for the tenants on the US3 cluster, and performance is returning to normal. We are monitoring the situation to ensure stability and confirm that the service remains fully functional. Impact: Services should be operating normally; however, we continue monitoring for irregularities. Next Steps: Tenants on the US3 cluster are still recovering and look healthy. We will continue to monitor the status of the tenants on US3 overnight and report back in the morning. Thank you for your patience, and we apologize for any inconvenience caused.
- monitoring Apr 01, 2025, 01:54 PM UTC
Affected Services: Internet Connection Service Cluster(s):US3 Description: Our team has implemented a fix for the issue affecting the Internet connection ping check for the tenants on the US3 cluster, and performance is returning to normal. We monitor the situation to ensure stability and confirm that the service remains fully functional. Impact: Services are operating normally for most sites. We do continue monitoring for irregularities with a few sites that have been contacted Next Steps: Tenants on the US3 cluster are still recovering and look healthy. We are attending to a few sites to regain full functionality. Thank you for your patience, and we apologize for any inconvenience caused.
- resolved Apr 01, 2025, 02:52 PM UTC
Affected Services: Internet Connection Service Cluster(s):US3 Description: The issue affecting Internet Connection Ping Checks has been fully resolved. Regular service has been restored, and all systems are operating as expected. Impact: Users should no longer experience any issues related to this incident. Next Steps: We are preparing a detailed Root Cause Analysis (RCA) report to provide further insights into the incident and preventive measures. Thank you for your patience, and we apologize for any inconvenience caused.
- postmortem Apr 09, 2025, 05:25 PM UTC
# Service Disruption - Cloud Ping Checks create false alerts on the US3 cluster. ## Root Cause Analysis ### Duration of incident Discovered: Mar 31, 2025, 13:15 - UTC Resolved: Apr 01, 2025, 15:52 - UTC ### Cause The performance of the ping server service on the US3 cluster degraded and produced invalid data. ### Effect The ping server service sent incorrect data based on the internet connection checks to the alerting service, which created large batches of false alerts sent to customers on the US3 cluster. ### Action taken _All times are in UTC_ **03/31/2025** **17:10 -** Ping Server started showing symptoms of degradation. **17:15 -** Internet Connections are marked offline. Customers experience excessive false alert reports based on the cloud ping check service on the US3 cluster. **17:20 -** The Auvik engineering team begins its investigation. **17:20-20:00 -** Auvik continues its investigations and disables the cloud ping service for several large customers on the US3 cluster to prevent excessive alerting once the service is restored. **20:00 -** Auvik resets the ping server service on the US3 cluster. Ping services fail over to the backup primary ping server service. **22:25 -** The primary ping server service load rises to a level that begins impacting customers on other clusters. **04/01/25** **00:00 -** The US3 cluster is restarted to revert cloud ping checks to the US3 cluster ping server services. Auvik notifies the customer where the cloud ping checks were disabled that the service will remain down until engineering can confirm they can be enabled without causing excessive alerting. **01:00-01:25 -** The US3 cluster fully restarts successfully. Functionality is restored for most clients on the US3 cluster. **12:00-15:30 -** Engineering reviews the disabled configurations and disables the responses to the cloud ping check-based alerts. **15:30-15:52 -** Auvik validates that all cloud ping check services and alerts are enabled for all customers on the US3 cluster. Additional clean-up commences. The incident is concluded. ### Future consideration\(s\) * Auvik is building a new cloud ping check server service for the product. This new server service will be rolled out gradually and is expected to be fully deployed into production over the next month. * Our error handling in the service that processes the cloud ping server data has been improved to identify and ignore invalid data. * Addresses will no longer be considered offline when invalid data is received.