Auvik incident
Service Degraded - Discovery Consolidation on US6 Cluster
Auvik experienced a minor incident on February 3, 2025 affecting us6.my.auvik.com, lasting 2h 32m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Feb 03, 2025, 07:23 PM UTC
Affected Services: Discovery Consolidation Cluster(s): US6 Description: We are currently experiencing degraded performance with the consolidation of devices. Our team is actively investigating the root cause and working to resolve the issue as quickly as possible. Impact: Users may experience issues with new device discovery and consolidation. Services: Alerting is not impacted. Next Steps: We will provide updates as more information becomes available or by 19:30 UTC. Thank you for your patience as we work to restore full functionality.
- investigating Feb 03, 2025, 07:42 PM UTC
Affected Services: Discovery Consolidation Cluster(s): US6 Description: We are currently experiencing degraded performance with the consolidation of devices. Our team is still actively investigating the root cause and working to resolve the issue as quickly as possible. Impact: Users may experience issues with new device discovery and consolidation. Services: Alerting is not impacted. Next Steps: We will provide updates as more information becomes available or by 20:30 UTC. Thank you for your patience as we work to restore full functionality.
- identified Feb 03, 2025, 08:31 PM UTC
Affected Services: Discovery Consolidation Cluster(s): US6 Description: Our team has identified the root cause of the degraded performance affecting new device discovery consolidation. Cluster(s): US6. We are currently investigating a solution to restore normal service levels. Impact: While we work on the resolution, users will continue to experience device discovery and consolidation. Services: Alerting is not impacted. Next Steps: Our team is actively working to resolve the issue and will provide updates as progress is made or within the next hour Thank you for your patience as we work to restore full functionality.
- monitoring Feb 03, 2025, 08:42 PM UTC
Affected Services: Discovery Consolidation Cluster(s): US6 Description: Our team has implemented a fix for the issue affecting the consolidation of devices, and the performance consolidation of devices is returning to normal. We are monitoring the situation to ensure stability and confirm that the service remains fully functional. Impact: Service is returning to normal; however, we continue monitoring for irregularities. Services Alerting was not impacted. Next Steps: We will provide a final update once we confirm the issue is fully resolved. Thank you for your patience, and we apologize for any inconvenience caused.
- resolved Feb 03, 2025, 09:56 PM UTC
Affected Services: Discovery Consolidation Cluster(s): US6 Description: The issue affecting Discovery Consolidation has been fully resolved. Normal service has been restored, and all systems are now operating as expected. Impact: Users should no longer experience any issues related to this incident. Next Steps: We are preparing a detailed Root Cause Analysis (RCA) report to provide further insights into the incident and preventive measures. Thank you for your patience, and we apologize for any inconvenience caused.
- postmortem Feb 10, 2025, 03:27 PM UTC
# **Service Degraded - Newly discovered devices and consolidation are not working for clients on the US6 cluster.** ## **Root Cause Analysis** ### Duration of incident Discovered: Feb 02, 2023 17:00 - UTC Resolved: Feb 02, 2023 21:30 - UTC ### Cause A reorganization of engineering caused a permission change for tenant migrations. ### Effect This change caused permission issues with a tenant migration to another cluster, which, in turn, also caused problems with consolidation on the same cluster. ### Action taken _All times in UTC_ **02/03/2025** **16:58 –** A tenant is migrated off of the US6 cluster. **18:00 –** Engineering is aware of consolidation issues for clients on the US6 cluster and begins investigating. **20:28 –** The initial cause for the interruption is determined. Engineering disables the migration service. **20:34 –** The tenant migration that caused the issues is identified. **20:47 –** The root cause of the interruption of services is identified. **21:15 –** The underlying issues that caused the service interruption are fixed. **21:45 –** Tenant migration is re-enabled tenant migrations in the consolidation service. **22:01 –** The problematic tenant is successfully migrated. **22:39 –** All services are confirmed to be running as intended. ### Future consideration\(s\) * Auvik is reviewing permission changes that have occurred and validating tests of the blast radius of the changes. * Any changes will have full comments and documentation created to follow the changes better. * Auvik will set up a test migration regularly to validate tenant migration functionality.