Auvik incident

Service Disruption - Auvik Dashboard in us1, us4 and au1

Auvik experienced a major incident on July 26, 2025 affecting us1.my.auvik.com and us4.my.auvik.com and 1 more component, lasting 19h 7m. The incident has been resolved; the full update timeline is below.

Started: Jul 26, 2025, 04:43 PM UTC
Resolved: Jul 27, 2025, 11:51 AM UTC
Duration: 19h 7m
Detected by Pingoru: Jul 26, 2025, 04:43 PM UTC

Affected components

us1.my.auvik.comus4.my.auvik.comau1.my.auvik.com

Update timeline

investigating Jul 26, 2025, 04:43 PM UTC

We are currently experiencing a service disruption. Our team is actively investigating the root cause and working to resolve the issue as quickly as possible. Impact: Users may experience that sites are not loading properly and intermittent disruption in monitoring. Next Steps: We will update this information as more details become available. We appreciate your patience as we work to restore full functionality.
investigating Jul 26, 2025, 05:20 PM UTC

We are continuing to investigate. We are rebooting all services in us4. Services are partially available in us1 and au1.
identified Jul 26, 2025, 07:42 PM UTC

We’ve identified a potential cause of the issue and are actively working on a fix. Some sites are beginning to load again, but intermittent issues may still persist for some users.
identified Jul 26, 2025, 08:12 PM UTC

us1 is now operational. We are continuing to work through partial issues in au1 and us4, where some sites are still experiencing service disruptions.
identified Jul 26, 2025, 10:09 PM UTC

au1 is now operational. Recovery efforts remain ongoing in region us4, where a maintenance window was initiated at 09:38 PM UTC to support remediation. We’ll continue to share updates as we make progress.
monitoring Jul 27, 2025, 03:36 AM UTC

Our team has implemented a fix for the disruption in us4, and the services are returning to normal. We continue to monitor the situation to ensure stability and confirm that the service remains fully functional.
resolved Jul 27, 2025, 11:51 AM UTC

The incident has been fully resolved. Regular service has been restored, and all systems operate as expected. Impact: Users should no longer experience any issues related to this service disruption. [RCA] We thank you for your understanding. If you continue to experience issues, please don't hesitate to contact our support team. We will post an RCA after an internal investigation.
postmortem Aug 11, 2025, 03:33 PM UTC

# Service Disruption - Hierarchical Data Display Issues ## Root Cause Analysis ### Duration of incident Discovered: July 26, 2025 14:00 UTC Resolved: July 31, 2025 12:45 UTC ### Cause Following a core system upgrade on July 26, 2025, multiple clusters experienced degraded performance due to unexpected issues during service initialization. This led to corruption in certain hierarchical data structures, which in turn impacted various user experiences. A core infrastructure failure caused an overload in internal system processes, preventing certain backend services from initializing properly. As a result, critical hierarchical user-role data and related settings failed to load or loaded incorrectly across several environments. ### Effect The disruption impacted customers across multiple regions. Effects included: * Missing custom settings such as alert preferences and interface configurations. * Alert notifications being sent to incorrect recipients. * Inability to access dashboards or view accurate site data. * Site maps failing to render correctly. * Login issues for end-users in some environments. * Inconsistent or missing hierarchical relationships in site selectors. * Temporary loss of monitoring due to disassociated shared collectors. * These issues collectively degraded service functionality, limited access for internal support teams, and disrupted monitoring workflows. ### Action taken _All times are in UTC_ **07/26/2025** **11:00** – Core upgrade initiated. **13:27** – Initial service failures observed. Engineering begins recovery processes. **14:19** – Backend query failures reported. Engineering continues to recover and stabilize services. **16:24** – Incident response team mobilized. **07/27/2025** _\(Services restored throughout the day\)_ Hierarchy services replayed and clusters rebooted to stabilize services. Alerting system functionality restored. **7/28/2025** _\(Services restored throughout the day\)_ Shared monitoring agents reassociated. Back-end service migrations initiated for affected tenants. **7/28/2025** _\(Services restored throughout the day\)_ Repair scripts run to reset affected data and restore processing pipelines. **7/30/2025-07/31/2025** _\(Services restored throughout the day\)_ Corrupted tenants identified and corrected. All affected clusters verified for service integrity. **07/31/2025** **12:45** - Incident considered resolved. ### Future consideration\(s\) * Improve validation checks during post-upgrade procedures to avoid cascading service impacts. * Temporarily pause specific background services \(e.g., data cleaners and processors\) during upgrades until core services are stable. * Implement automated detection for corrupted tenant hierarchies or missing role-based configurations. * Revisit default alert notification behavior to avoid unintended mass-notifications.