Auvik incident

Performance Degraded - Information rendering in the Auvik User interface for tenants on the US3 cluster

Minor Resolved View vendor source →

Auvik experienced a minor incident on July 3, 2025 affecting us3.my.auvik.com, lasting 1h 30m. The incident has been resolved; the full update timeline is below.

Started
Jul 03, 2025, 05:14 PM UTC
Resolved
Jul 03, 2025, 06:44 PM UTC
Duration
1h 30m
Detected by Pingoru
Jul 03, 2025, 05:14 PM UTC

Affected components

us3.my.auvik.com

Update timeline

  1. investigating Jul 03, 2025, 05:14 PM UTC

    Affected Services: Access to part of the UI Service not impacted: Monitoring and alerting functionality Description: We are currently experiencing degraded performance with access to the user Interface. Our team is actively investigating the root cause and working to resolve the issue as quickly as possible. Impact: Users may experience issues with their user interface. Next Steps: We will update this information as more details become available. Thank you for your patience as we work to restore full functionality.

  2. identified Jul 03, 2025, 05:35 PM UTC

    Affected Services: Access to part of the UI Service not impacted: Monitoring and alerting functionality Description: We are currently experiencing degraded performance with access to the user Interface, and are currently investigating a solution to restore normal service levels. Impact: While we work on the resolution, users may continue to experience interruptions in the User interface. Next Steps: We will provide updates as the situation progresses. Your patience is greatly appreciated, and we regret any inconvenience you may be experiencing.

  3. monitoring Jul 03, 2025, 06:25 PM UTC

    User Affected Services: Access to part of the UI Cluster(s): US3 Description: Our team has implemented a fix for the performance issues affecting the User interface. We monitor the system to ensure stability and confirm that performance remains at expected levels. Impact: System performance should be restored to normal. We will continue to monitor for any irregularities. Services [services not affected] are not impacted. Next Steps: A final update will be provided once we confirm the issue is resolved. We appreciate your patience as we work through this issue.

  4. resolved Jul 03, 2025, 06:44 PM UTC

    Affected Services: Access to part of the UI Cluster(s): US3 Description: The performance issue affecting access to the UI has been fully resolved, and normal operations have resumed. All systems are functioning as expected. Impact: Users should no longer experience any performance-related issues. Next Steps: Service has been restored. We apologize for the disruption and appreciate your continued patience. If you continue to experience issues, please contact our support team.

  5. postmortem Jul 29, 2025, 05:50 PM UTC

    # Service Degraded - API and UI interruption on the US3 Cluster ## Root Cause Analysis ### Duration of incident Discovered: Jul 3, 2025, 17:10 UTC Resolved: Jul 3, 2025, 18:30 UTC ### Cause A service component in the US3 region experienced a critical failure that triggered a crash loop, rendering an internal service inoperable. ### Effect Users experienced disruption to both the user interface and API functions in the US3 environment. ### Action taken _All times are in UTC_ **07/03/2025** **17:10** Deployment of user interface \(UI\) change. **17:30** Engineering alerted to issues with the UI and APIs on US3. **17:40** Attempted rollback of the most recent deployment version. **17:50** Crash loop persisted despite rollback, indicating the issue was not caused by regression. **17:52** Engineering alerted to issues with the UI and APIs on US3. Introduced more detailed diagnostic logging and restarted the affected services. **18:12** Comparison against other environments revealed the same error had occurred without incident elsewhere, suggesting environmental context was a key factor. **18:30** Services stabilized and returned to normal. ### Future consideration\(s\) * Review environmental differences that contributed to varying behavior between clusters. * Consider adding a safe-fail mechanism for deployments to prevent full-service crash loops.