Auvik incident

Service Degraded - Cloud Ping check not working for some tenants

Minor Resolved View vendor source →

Auvik experienced a minor incident on February 19, 2025 affecting us3.my.auvik.com, lasting 3h 13m. The incident has been resolved; the full update timeline is below.

Started
Feb 19, 2025, 05:14 PM UTC
Resolved
Feb 19, 2025, 08:27 PM UTC
Duration
3h 13m
Detected by Pingoru
Feb 19, 2025, 05:14 PM UTC

Affected components

us3.my.auvik.com

Update timeline

  1. investigating Feb 19, 2025, 05:14 PM UTC

    Affected Services: Cloud Ping Check Cluster(s): US3 Description: We are currently experiencing degraded performance with the Cloud Ping Service check. Our team is actively investigating the root cause and working to resolve the issue as quickly as possible. Impact: Users may experience excessive false alerts. Services: All other monitoring, alerting, maps and integrations are not impacted. Next Steps: We will provide updates as more information becomes available or within the next hour Thank you for your patience as we work to restore full functionality.

  2. investigating Feb 19, 2025, 05:28 PM UTC

    Affected Services: Cloud Ping Check Cluster(s): US3 Description: We are currently experiencing degraded performance with the Cloud Ping Service check. Our team is actively investigating the root cause and working to resolve the issue as quickly as possible. Impact: Users may experience excessive false alerts. Services: All other monitoring, alerting, maps and integrations are not impacted. Auvik recommends you disable your Cloud Ping Check and any customized Cloud Ping Check alerts until the problem is resolved, Next Steps: We will provide updates as more information becomes available or by 18:00 UTC Thank you for your patience as we work to restore full functionality.

  3. identified Feb 19, 2025, 06:00 PM UTC

    Affected Services: Cloud Ping Check Cluster(s): All Clusters Description: We are currently experiencing degraded performance with the Cloud Ping Service check. Our team is actively investigating the root cause and working to resolve the issue as quickly as possible. Impact: Users may experience excessive false alerts. Resources failing over from US3 may affect alerting in other clusters. Services: All other monitoring, alerting, maps and integrations are not impacted. Next Steps: Our team is actively working to resolve the issue and will provide updates as progress is made or by 19:00 UTC Thank you for your patience as we work to restore full functionality.

  4. identified Feb 19, 2025, 06:44 PM UTC

    Affected Services: All alerts Cluster(s): All Clusters Auvik is posting an emergency maintenance window to disable alerts starting at 19:00 UTC. Alerts are scheduled to be re-enabled by 20:00 UTC Thank you for your patience as we work to restore full functionality.

  5. identified Feb 19, 2025, 07:21 PM UTC

    Affected Services: All alerts Cluster(s): All Clusters The alerting maintenance window has been ended. alerts will now flow as intended Thank you for your patience as we work to restore full functionality.

  6. monitoring Feb 19, 2025, 07:28 PM UTC

    Affected Services: Cloud Ping Check Cluster(s): All Clusters Description: We are currently experiencing degraded performance with the Cloud Ping Service check. Our team is actively investigating the root cause and working to resolve the issue as quickly as possible. Impact: Service should operate normally; however, we continue monitoring for any irregularities. Services: All other monitoring, alerting, maps and integrations are not impacted. Next Steps: We will provide a final update once we confirm the issue is fully resolved. Thank you for your patience, and we apologize for any inconvenience caused.

  7. resolved Feb 19, 2025, 08:27 PM UTC

    Affected Services: Cloud Ping Check Cluster(s): All Clusters Description: We are currently experiencing degraded performance with the Cloud Ping Service check. Our team is actively investigating the root cause and working to resolve the issue as quickly as possible. Impact: Users should no longer experience any issues related to this incident. Next Steps: We are preparing a detailed Root Cause Analysis (RCA) report to provide further insights into the incident and preventive measures. Thank you for your patience, and we apologize for any inconvenience caused.

  8. postmortem Mar 11, 2025, 01:52 PM UTC

    # Service Degraded - Cloud Ping Services Check Failing Intermittently on the US3 Cluster ## Root Cause Analysis ### Duration of incident Discovered: Feb 19, 2025 14:18 - UTC Resolved: Mar 01, 2025 15:00 - UTC ### Cause The Cloud Ping service became unstable due to a large number of clients running ping checks at a 5-second interval, leading to widespread ping check failures. ### Effect Clients received excessive Cloud ping check alerts corresponding to failed pings. ### Action taken _All times in UTC_ **02/13/2025-02/19/2025** Auvik started receiving complaints about an unusually high number of internet connection failures. A general investigation begins with customers reporting these issues. **02/19/2025** **14:18** - Auvik Engineering ascertains that the US3 cluster has several clients with a high number of internet connection checks set to the 5-second setting. An internal investigation then begins. **17:42** - Auvik disables Cloud Ping alerts in the US3 cluster for those affected. **17:53-18:44** - Auvik Engineering decides to restart the ping service to help clear the lag and re-stabilize it. A maintenance window is required to perform this action. **19:00** - A one-hour maintenance window is started. **19:21** - The work required under the maintenance window concludes early, and the services are back up and running. Cloud Ping alerts are restored for all clients. **02/24/2025** It’s noted that while the ping service is behaving normally for most clients, there continue to be intermittent problems. It is determined that a complete cluster restart is required. To minimize the impact on all customers, a decision is made to do maintenance on 03/01/2025 **03/01/2025** **12:00-15:00** - Auvik undergoes maintenance, during which US3 is safely restarted to restore the health of all services. ### Future consideration\(s\) * Auvik has worked with several clients who have set up a 5-second ping check to regulate the flow and prevent system overload. * Auvik will remove clients' ability to perform a 5-second cloud ping check and default the check frequency to one minute. The timing of this change will follow in future Auvik release notes.