Auvik incident

Auvik Reporting Sites Down Post After Maintenance

Auvik experienced a notice incident on May 12, 2025, lasting —. The incident has been resolved; the full update timeline is below.

Started: May 12, 2025, 02:01 PM UTC
Resolved: May 10, 2025, 02:00 PM UTC
Duration: —
Detected by Pingoru: May 12, 2025, 02:01 PM UTC

Update timeline

resolved May 12, 2025, 02:01 PM UTC

Towards the end of Auvik's scheduled maintenance window, on 5/10/2025, Engineering noticed some loading issues with sites on several clusters. Upon investigation, it was determined there was an issue with the data flow between systems. This interruption required Auvik to extend its maintenance window. Auvik was able to bring each cluster's tenants up throughout the process. This work was considered completed at 21:00 EDT. Auvik will furnish an RCA after an internal review has been completed.
postmortem May 16, 2025, 02:26 PM UTC

# Service Disruption - Sites are not available after maintenance ## Root Cause Analysis ### Duration of incident Discovered: May 10, 2025 13:04 - UTC Resolved: May 11, 2025 01:00 - UTC ### Cause A scheduled upgrade of the system failed to complete successfully. ### Effect Auvik functionality was impacted after the upgrade was implemented. This began a cascade of product functionality failures that required reimplementing the upgraded version using a stepped restart of Auvik. ### Action taken _All times are in UTC_ **04/10/2025** **11:00** Upgrade process begins on core components. **12:45** An issue is detected affecting data replication, and some clusters experience connectivity problems. **13:05** Engineering begins active investigation into the connectivity issue. **13:24** Recovery actions initiated for affected clusters. **13:49** Maintenance window extended to address ongoing issues. **14:00-14:05** Impacted clusters begin recovering. **14:21** Post-upgrade validation reveals a new issue affecting dashboard display in most regions. **14:35** Further analysis confirms the issue affects multiple clusters. **15:00** Deeper technical investigation begins to isolate the root cause, which is suspected to involve backend services. **17:04** Root cause identified as an issue with a core data processing component. **17:20** Mitigation strategies explored; decision made to re-attempt the upgrade with a modified approach. **18:30-20:17** Second upgrade process begins; similar issues surface in specific regions. **21:00-21:25** Recovery actions for affected clusters show positive results; services begin to stabilize. **21:30-21:40** Core services successfully rolled out to additional clusters with improved configuration. **23:47** One final cluster exhibits recovery issues, addressed through targeted intervention. **05/11/2025** **00:00-01:00** Final recovery actions completed; all services return to normal. **01:00** Complete system restoration is confirmed. ### Future consideration\(s\) * Implement additional alerting to monitor bandwidth issues on the backend systems more effectively and proactively to prevent bottlenecks. * Complete the improvements that are already in progress. * Mitigate the load placed on all backend systems simultaneously after a maintenance window. * Remove several single-point failure configurations with more scalable configurations.