Auvik incident

Collectors disconnected in us3

Major Resolved View vendor source →

Auvik experienced a major incident on July 25, 2025 affecting us3.my.auvik.com, lasting 1h 49m. The incident has been resolved; the full update timeline is below.

Started
Jul 25, 2025, 07:24 PM UTC
Resolved
Jul 25, 2025, 09:13 PM UTC
Duration
1h 49m
Detected by Pingoru
Jul 25, 2025, 07:24 PM UTC

Affected components

us3.my.auvik.com

Update timeline

  1. investigating Jul 25, 2025, 07:24 PM UTC

    Collectors have been disconnected in us3 since 14:35 ET (18:35 UTC). We are investigating the issue.

  2. monitoring Jul 25, 2025, 08:27 PM UTC

    We have restarted the affected systems, and collectors are beginning to reconnect. We are monitoring the recovery closely.

  3. resolved Jul 25, 2025, 09:13 PM UTC

    The incident has been resolved.

  4. postmortem Jul 30, 2025, 04:27 PM UTC

    # Service Disruption - Collectors offline for clients on the US3 cluster ## Root Cause Analysis ### Duration of incident Discovered: Jul 25, 2025 14:45 - UTC Resolved: Jul 25, 2025 17:54 - UTC ### Cause Backend nodes were removed from the US3 cluster during a routine cleanup effort intended to optimize efficiencies.. This removal unintentionally included the backend hosting the root tenant, leading to disconnected collectors within our US3 cluster. ### Effect Collectors in the US3 cluster lost connectivity with customer sites, resulting in disruptions to data collection and monitoring services. This caused temporary gaps in visibility across affected environments. ### Action taken _All times are in UTC_ **07/25/2026** **14:45** - Engineering notices that collector connections are beginning to fail. **18:28** – Tenants not loading observed by the team. **18:35** – Outage reports increase. **18:40** – SEV declared, and the root cause investigation begins. **18:48** – Backends re-added to balance load. **19:00** – Alternate issues ruled out. **19:23** – Root tenant backend identified as missing. **19:25** – Cluster restart initiated. **20:18** – Services begin recovery. **21:07** – Incident resolved. ### Future consideration\(s\) * Strengthen the backend removal process to confirm the root tenant is excluded.