Auvik incident

Clients on US4 cluster are experiencing 500 Errors

Minor Resolved View vendor source →

Auvik experienced a minor incident on October 15, 2025 affecting us4.my.auvik.com, lasting 2h 11m. The incident has been resolved; the full update timeline is below.

Started
Oct 15, 2025, 03:22 PM UTC
Resolved
Oct 15, 2025, 05:33 PM UTC
Duration
2h 11m
Detected by Pingoru
Oct 15, 2025, 03:22 PM UTC

Affected components

us4.my.auvik.com

Update timeline

  1. investigating Oct 15, 2025, 03:22 PM UTC

    We are currently investigating reports of 500 errors affecting access to clients on the US4 cluster. Impact: Customers may experience access issues to their tenants. Alerts and monitoring are not affected The other clusters are not affected. Next Steps: Our team is working to identify contributing factors. Updates will follow as more information becomes available.

  2. investigating Oct 15, 2025, 04:39 PM UTC

    We are continuing to investigate HTTP 500 errors affecting access to clients on the US4 cluster. Impact: Customers will experience access issues to their tenants. Alerts and monitoring are not affected. The other clusters are not affected. Next Steps: Our team is working to identify contributing factors. Updates will follow as more information becomes available.

  3. identified Oct 15, 2025, 05:04 PM UTC

    Our team has identified a suspected cause of the HTTP 500 site access issue and is taking steps to remediate it. Impact: Customers should now be able to access their sites without encountering an HTTP 500 error. Maps and site inventories are not rendering correctly in the UI. The following services are not affected: Monitoring and Alerting. Please report any related issues to Auvik Support so we can track and assist further. Next Steps: We are applying mitigation measures and will provide updates on progress.

  4. monitoring Oct 15, 2025, 05:22 PM UTC

    We have applied changes to address the issues. Services appear to be operating normally, and we are monitoring closely for stability. Impact: Services should be operating normally; however, if you continue to encounter problems, please report them to Auvik Support. Next Steps: A final update will be posted once we confirm the resolution.

  5. resolved Oct 15, 2025, 05:33 PM UTC

    The incident has been fully resolved, and all services are operating normally. Customers should no longer experience any related issues. If you continue to experience problems, please don't hesitate to contact Auvik Support. We will provide a Root Cause Analysis (RCA) once it is available.

  6. postmortem Oct 31, 2025, 02:27 PM UTC

    # Service Degraded - Sites have lost settings after maintenance ## Root Cause Analysis ### Duration of the incident Discovered: Oct 11, 2025 – 12:00 UTC Resolved: Oct 15, 2025 – 22:00 UTC ### Cause During a scheduled system update, a background maintenance process unintentionally removed reference files used to identify stored site configurations. When affected systems restarted after the update, they were unable to locate those configuration files and temporarily appeared as new, empty sites. This occurred because the maintenance process was using outdated information when determining which data to clean up safely. The underlying data remained securely stored, but the missing reference files prevented normal access until they were restored. ### Effect A subset of tenants across multiple clusters temporarily lost access to their site configurations and appeared as newly created environments. Customers observed missing data and configurations, including previously defined network settings and device details. ### Action taken _All times are in UTC_ **10/11/2025** **12:00** – A scheduled system upgrade began across all clusters. **15:25** – The support team received reports from customers that some sites appeared empty or missing data. **16:00** – Engineering immediately began investigating and determined this was not related to normal data processing delays. **10/12/2025** Additional reports confirmed that several sites were missing configuration information. The engineering team confirmed that the original data was still securely stored, but was not being correctly loaded by the system. **10/13/2025** The issue was traced to missing metadata files that help identify stored configurations. The engineering team began restoring affected sites using the most recent valid configuration data. Automated recovery tools were developed to safely restore additional sites and ensure consistent recovery across all clusters. **10/14/2025** Engineering verified that configuration data was fully restored and synchronized across supporting services. The recovery process was extended to all remaining sites, with validation steps confirming successful restoration. **10/15/2025** Final recovery efforts for the remaining affected clusters were completed.. **22:00** – All affected sites were confirmed operational with their configurations restored and verified. ### Future consideration\(s\) * Remove dependency of cleanup processes on outdated cluster data sources. * Validate all automated cleanup jobs to ensure they do not operate on production clusters. * Implement monitoring for missing or corrupted metadata files before deployments. * Enhance post-deployment validation to verify the integrity of configuration data across all clusters.