Arista CloudVision incident

Elevated connections and requests failure

Arista CloudVision experienced a critical incident on September 8, 2025 affecting Core Platform and Core Platform and 1 more component, lasting 2h 38m. The incident has been resolved; the full update timeline is below.

Started: Sep 08, 2025, 12:56 PM UTC
Resolved: Sep 08, 2025, 03:35 PM UTC
Duration: 2h 38m
Detected by Pingoru: Sep 08, 2025, 12:56 PM UTC

Affected components

Core PlatformCore PlatformCore PlatformCore PlatformCore PlatformCore PlatformCore PlatformCore PlatformCore Platform

Update timeline

investigating Sep 08, 2025, 12:56 PM UTC

We are currently investing alerts about connection failure affecting all service regions.
identified Sep 08, 2025, 01:18 PM UTC

We have identified the root cause of the issue. The team is working on a fix.
identified Sep 08, 2025, 02:01 PM UTC

A fix is rolling out on all service regions.
monitoring Sep 08, 2025, 02:15 PM UTC

Services is coming back up, but with degraded performance. We are monitoring the situation.
monitoring Sep 08, 2025, 02:22 PM UTC

uk-1, india-1 and na-northeast1-b regions are fully operational.
monitoring Sep 08, 2025, 02:26 PM UTC

euwest-2 region is fully operational.
monitoring Sep 08, 2025, 02:42 PM UTC

ausoutheast-1 and apnortheast-1 regions are fully operational.
monitoring Sep 08, 2025, 02:56 PM UTC

us-central1-b region is fully operational.
monitoring Sep 08, 2025, 03:07 PM UTC

us-central1-a region is fully operational.
monitoring Sep 08, 2025, 03:31 PM UTC

us-central1-c region is fully operational.
resolved Sep 08, 2025, 03:35 PM UTC

This incident is now fully resolved. Postmortem for this incident is still pending. Our team is currently actively working on reviewing this incident, and postmortem update.
postmortem Sep 16, 2025, 05:54 PM UTC

**Incident Report: CVaaS Service Disruption - September 8, 2025** On September 8, 2025, CloudVision as-a-Service experienced a service disruption, lasting from 8 minutes in our UK region \(cv-prod-uk-1\) to 25 minutes in a US region \(us-central1-a\). During this time, users were unable to establish new connections, and transient connection errors occurred for those with existing connections. We do not expect any data loss, as device state streaming was coalesced and resumed after the disruption. This incident was not caused by a security breach, intrusion, or other malicious activity. ‌ **Root Cause** An internal certificate expired within our trust chain, causing a disruption. The certificate was not properly monitored in our system, which notifies and automatically renews certificates, leading to its expiration. ‌ **Our Response** Our team was alerted to similar service disruptions across various global regions, and a status page incident was posted because the issue was not transient and was unrelated to our service provider. The disruption was triaged, the root cause identified, and mitigations implemented within 20 minutes. Corrective actions were taken simultaneously over the next approximately 35 minutes. The incident was closed after an assessment of all regions was completed, and services were restored. We understand that an outage of this nature is unacceptable and apologize to all affected CloudVision as-a-Service users impacted by this incident. We are committed to implementing changes identified during our internal investigation to prevent this type of failure from happening again. Thank you for using CloudVision as-a-Service.