Auvik incident

Performance Degraded - Information rendering in the Auvik User interface for tenants on the US3 cluster

Auvik experienced a minor incident on July 3, 2025 affecting us3.my.auvik.com, lasting 1h 30m. The incident has been resolved; the full update timeline is below.

Started: Jul 03, 2025, 05:14 PM UTC
Resolved: Jul 03, 2025, 06:44 PM UTC
Duration: 1h 30m
Detected by Pingoru: Jul 03, 2025, 05:14 PM UTC

Affected components

us3.my.auvik.com

Update timeline

investigating Jul 03, 2025, 05:14 PM UTC

Affected Services: Access to part of the UI Service not impacted: Monitoring and alerting functionality Description: We are currently experiencing degraded performance with access to the user Interface. Our team is actively investigating the root cause and working to resolve the issue as quickly as possible. Impact: Users may experience issues with their user interface. Next Steps: We will update this information as more details become available. Thank you for your patience as we work to restore full functionality.
identified Jul 03, 2025, 05:35 PM UTC

Affected Services: Access to part of the UI Service not impacted: Monitoring and alerting functionality Description: We are currently experiencing degraded performance with access to the user Interface, and are currently investigating a solution to restore normal service levels. Impact: While we work on the resolution, users may continue to experience interruptions in the User interface. Next Steps: We will provide updates as the situation progresses. Your patience is greatly appreciated, and we regret any inconvenience you may be experiencing.
monitoring Jul 03, 2025, 06:25 PM UTC

User Affected Services: Access to part of the UI Cluster(s): US3 Description: Our team has implemented a fix for the performance issues affecting the User interface. We monitor the system to ensure stability and confirm that performance remains at expected levels. Impact: System performance should be restored to normal. We will continue to monitor for any irregularities. Services [services not affected] are not impacted. Next Steps: A final update will be provided once we confirm the issue is resolved. We appreciate your patience as we work through this issue.
resolved Jul 03, 2025, 06:44 PM UTC

Affected Services: Access to part of the UI Cluster(s): US3 Description: The performance issue affecting access to the UI has been fully resolved, and normal operations have resumed. All systems are functioning as expected. Impact: Users should no longer experience any performance-related issues. Next Steps: Service has been restored. We apologize for the disruption and appreciate your continued patience. If you continue to experience issues, please contact our support team.
postmortem Jul 29, 2025, 05:50 PM UTC

# Service Degraded - API and UI interruption on the US3 Cluster ## Root Cause Analysis ### Duration of incident Discovered: Jul 3, 2025, 17:10 UTC Resolved: Jul 3, 2025, 18:30 UTC ### Cause A service component in the US3 region experienced a critical failure that triggered a crash loop, rendering an internal service inoperable. ### Effect Users experienced disruption to both the user interface and API functions in the US3 environment. ### Action taken _All times are in UTC_ **07/03/2025** **17:10** Deployment of user interface \(UI\) change. **17:30** Engineering alerted to issues with the UI and APIs on US3. **17:40** Attempted rollback of the most recent deployment version. **17:50** Crash loop persisted despite rollback, indicating the issue was not caused by regression. **17:52** Engineering alerted to issues with the UI and APIs on US3. Introduced more detailed diagnostic logging and restarted the affected services. **18:12** Comparison against other environments revealed the same error had occurred without incident elsewhere, suggesting environmental context was a key factor. **18:30** Services stabilized and returned to normal. ### Future consideration\(s\) * Review environmental differences that contributed to varying behavior between clusters. * Consider adding a safe-fail mechanism for deployments to prevent full-service crash loops.