UiPath incident

US - Cloud Robots VM - 3rd party service provider outage

Major Resolved View vendor source →

Started: Apr 24, 2026, 04:14 PM UTC
Resolved: Apr 25, 2026, 12:02 AM UTC
Duration: 7h 48m
Detected by Pingoru: Apr 24, 2026, 04:14 PM UTC

Affected components

Cloud Robots - VMCloud Robots - VM

Update timeline

identified Apr 24, 2026, 04:14 PM UTC

The upstream Cloud provider has confirmed an outage impacting VMs for Cloud Robots VM in the US and Delayed US regions Impact: Users may be unable to start robots Next update: We are working with the provider to understand mitigation timelines.
identified Apr 24, 2026, 05:05 PM UTC

We are still awaiting further details from the cloud service provider. We are exploring failover options as well
identified Apr 24, 2026, 05:58 PM UTC

The cloud service provider has identified the issue and started applying a mitigation, we are continuing to follow up with them for more updates. We do not have an ETA for when the mitigation will be completed yet.
monitoring Apr 24, 2026, 06:10 PM UTC

The cloud service provider has applied mitigations and is starting to see improvements from their end. We are monitoring our services to ensure they are recovering as well.
monitoring Apr 24, 2026, 06:59 PM UTC

Some VMs have still not recovered, and the cloud service provider is still actively working on completing their mitigation efforts
monitoring Apr 24, 2026, 07:47 PM UTC

The cloud service provider is still actively working on completing their mitigation efforts
monitoring Apr 24, 2026, 08:23 PM UTC

We are seeing the remaining VMs recover, we will monitor to ensure there is no regression
monitoring Apr 24, 2026, 09:17 PM UTC

While we are not seeing any more impact on Cloud Robot VMs, we are continuing to follow the cloud provider's outage until it is fully resolved.
monitoring Apr 24, 2026, 10:10 PM UTC

We are continuing to follow the cloud provider's outage until it is fully resolved.
monitoring Apr 24, 2026, 11:12 PM UTC

We are continuing to follow the cloud provider's outage until it is fully resolved.
resolved Apr 25, 2026, 12:02 AM UTC

The issue has been resolved
postmortem Apr 28, 2026, 04:12 PM UTC

## Customer Impact Between approximately 3:00 pm UTC on April 24, 2026, and 12:04 am UTC on April 25, 2026, a subset of customers in the US Region experienced failures when starting, restarting, or provisioning cloud robots \(virtual machines\) through Automation Cloud. Impacted customers encountered errors such as "Partially initialized” or “Failed” status on machines. Existing virtual machines that were already running generally continued to operate, but attempts to provision new machines or restart stopped ones frequently failed. Some customers also experienced delays in job scheduling. The disruption lasted approximately nine hours. We sincerely apologize for the impact this incident had on your automation workflows. We understand that reliable cloud robot availability is critical to your operations, and we take this disruption seriously. ## Root cause The incident was caused by a widespread outage affecting the virtual machine service of our underlying cloud infrastructure provider in the US Region. The provider's outage began at 11:39 am UTC on April 24, 2026—several hours before customer-facing impact became apparent—when a recent deployment to their virtual machine platform introduced a fault that disrupted the ability to start, restart, or provision new virtual machines across multiple availability zones. The outage affected multiple infrastructure services beyond virtual machines, including networking, caching, and container orchestration components. At the time of initial investigation, multiple machines were found with a requested status of "running" but an actual status of "partially initialized," all having transitioned to this failed state within a narrow window around 3:00–3:30 pm UTC. Second, the provider's outage caused connectivity failures within portions of our service infrastructure in the US Region, which led to errors surfacing to the customer while interacting with the platform. ## Detection The incident was first detected at approximately 3:00 pm UTC on April 24, 2026, when the first customer report was received indicating that cloud robots could not be started. Automated monitoring surfaced error patterns including "partially initialized" shortly thereafter. By 3:52 pm UTC, the incident was formally declared and a response team was assembled. Because the failures were mostly limited to VM lifecycle operations, and not all running workloads, existing automated health checks did not trigger for all affected scenarios. There was a gap of approximately 15–20 minutes between the first customer report and full scope determination, as the initial impact was sporadic and became clear only after correlating customer reports with infrastructure telemetry. The team identified multiple machines in a failed state across multiple affected organizations. By 4:15 pm UTC, the scope was sufficiently understood, and a status page update was posted to inform customers of the identified issue. A high-priority support case was also opened with the cloud provider at this time. Notably, the provider's outage had begun at 11:39 am UTC, over three hours before customer-facing impact was detected. The delay between the provider's outage start and observable customer impact is being examined as part of our detection improvement efforts. ## Response Upon detection, our engineering team immediately began investigating the scope and root cause of the failures. By querying internal systems, the team identified that the issue was isolated to the US Region and primarily affected VM start and provisioning operations. The team correlated affected accounts and machines to determine the breadth of impact. Simultaneously, the team investigated broader service degradation and discovered that portions of our service infrastructure in the US Region were experiencing connectivity failures caused by the provider's outage. This caused some platform calls to time out, contributing to automation job scheduling delays. The team traced these failures to specific infrastructure hosts that had been impaired by the provider's outage. The following mitigation actions were taken: * **Service component relocation:** At approximately 6:20 pm UTC, affected service components were relocated from impaired infrastructure hosts to healthy ones. This was performed carefully, one component at a time, to minimize risk. After relocation, the platform service “hypervisor” responded again to calls successfully * **Cloud provider engagement:** A high-priority support case was opened with the cloud provider, and the team monitored their public status page for updates. The provider confirmed at approximately 5:40 pm UTC that they had begun reverting their faulty deployment. The team also submitted a detailed list of affected VM identifiers to the provider's support case to assist their investigation. * **VM recovery testing:** The team conducted targeted tests on affected VMs to verify restoration. Some VMs in certain availability zones remained impacted even after initial mitigations, as the provider's recovery progressed zone by zone. By 8:12 pm UTC, previously affected VMs were confirmed operational, even before the provider had updated their own status page. However, at approximately 10:05 pm UTC, the provider reported a regression in one availability zone and initiated a second corrective action expected to take up to three hours, extending the monitoring period. By April 25, 2026 at 12:04 am UTC, VM operations were consistently succeeding across all availability zones, and the incident was marked as resolved. Throughout the event, we maintained regular status page updates and communicated directly with impacted customers, including proactive outreach to verify that affected machines had returned to normal operation. ## Follow-up To reduce the risk and impact of similar incidents in the future, we are implementing several targeted improvements: 1. **Enhanced detection and alerting:** We are expanding our monitoring to include more granular checks on VM lifecycle operations, ensuring that failures in start, restart, or provisioning actions are surfaced immediately—even when running workloads are unaffected. This includes adding VM health monitoring capabilities that were not previously in place; correcting alert configurations that referenced incorrect regions during this incident, and exploring earlier detection of upstream provider outages before they manifest as customer-facing impact. We are also investigating ways to reduce the three-hour gap between the provider's outage onset and our initial detection of customer impact. 2. **Automated impact correlation:** We are developing automated tooling to rapidly identify affected accounts and machines based on error states, enabling faster scoping and customer notification. During this incident, impact assessment required manual queries, we are automating this process to significantly reduce response time. 3. **Regional failover readiness:** We are investing in infrastructure changes to support more flexible failover and workload migration for cloud robots, including the ability to provision new VMs in alternate regions when a primary region is impaired. Currently, cloud robot VMs are region-bound and no backup provisioning path exists in a secondary region. We are addressing this gap to provide greater resilience against single-region provider outages—a recurring pattern we have observed across similar past events. 4. **Customer guidance and communication:** We are updating our customer-facing documentation and in-product messaging to provide clear guidance on steps to take when VM operations fail due to underlying infrastructure outages. We are also improving our status page update cadence and clarity to keep customers better informed during extended incidents. This incident follows a pattern seen in similar past events, where external platform outages in a single region have disrupted automation services. We are applying lessons learned from those events—including the importance of rapid detection, clear customer communication, and resilient failover strategies—to drive systematic improvements. Our commitment is to continually strengthen our platform's reliability and transparency, so customers can trust Automation Cloud for their most critical workloads.

Looking to track UiPath downtime and outages?

Pingoru polls UiPath's status page every 5 minutes and alerts you the moment it reports an issue — before your customers do.

Real-time alerts when UiPath reports an incident
Email, Slack, Discord, Microsoft Teams, and webhook notifications
Track UiPath alongside 5,000+ providers in one dashboard
Component-level filtering
Notification groups + maintenance calendar

Start monitoring UiPath for free

5 free monitors · No credit card required