Auvik experienced a minor incident on April 25, 2024 affecting my.auvik.com and us1.my.auvik.com and 1 more component, lasting 7h 6m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- identified Apr 25, 2024, 06:30 PM UTC
We’ve identified the source of the performance issue in the discovery of new devices. We are working to restore optimal service as quickly as possible.
- monitoring Apr 25, 2024, 06:49 PM UTC
We’ve identified the source of the performance issue with delays in new device discovery and are monitoring the situation. We've implemented the fix and are waiting for device information to catch up in the system. As the lag catches up, we expect to be back to optimal performance in a few hours. We’ll keep you posted on a resolution.
- resolved Apr 26, 2024, 01:36 AM UTC
The delay for device discovery has been resolved. The source of the performance impact has been addressed, and performance should again be optimal. A Root Cause Analysis (RCA) will follow after completing a full review.
- postmortem May 08, 2024, 01:39 PM UTC
# Performance Disruption - Delays with New Device Discovery ## Root Cause Analysis ### Duration of incident Discovered: Apr 25, 2024 14:00- UTC Resolved: Apr 26, 2024 01:30- UTC ### Cause Changes were placed into production to address findings from the Auvik March 15, 2024, incident. The changes were not behind a feature flag to prevent them from affecting production data. ### Effect The changes were not granted proper permissions, which caused a data crash loop. This delayed newly discovered devices. ### Action taken _All times in UTC_ **04/24/2024** **14:00-17:30 -** Updated code merged into production code to address the bug discovered in the Auvik March 15, 2024, incident. **4/25/2024** **14:35 -** An approved tenant migration causes a crash loop of data for newly discovered devices. **18:04 -** The Auvik engineering team responsible for the implemented change is made aware of the crash loop and delay in rendering new devices in the product. **18:17 -** Engineering determines the cause of the crash loop and adjusts permissions for the implemented changes. **18:23 -** The changes implemented for permissions have the desired effect, and consumer lag begins to improve. Data will be delayed as the lag catches up to the live production data. **4/26/2024** **01:30 -** Consumer lag fully recovers, and all data is current. The incident is closed. ### Future consideration\(s\) * Changes have been implemented to adjust service account permissions for improvements to code automatically. * An internal review was performed on the review of code changes and approval processes for production. * Adjustments to internal alerting are reviewed to highlight the prioritization of production-impacted changes.