Okta incident

Workflows delayed and timing out.

Okta experienced a minor incident on August 1, 2025 affecting okta.com cell 1 and okta.com cell 2 and 1 more component, lasting 10d 5h. The incident has been resolved; the full update timeline is below.

Started: Aug 01, 2025, 02:29 PM UTC
Resolved: Aug 11, 2025, 08:13 PM UTC
Duration: 10d 5h
Detected by Pingoru: Aug 01, 2025, 02:29 PM UTC

Affected components

okta.com cell 1okta.com cell 2okta.com cell 3okta.com cell 4okta.com cell 6okta.com cell 7okta.com cell 11okta.com cell 14Workflows

Update timeline

resolved Aug 01, 2025, 02:29 PM UTC

At 8/1/2025 7:29 AM PT, the Workflows Automation and Extensibility team became aware of an issue with timeouts to our service affecting customers in US Cell 1, US Cell 2, US Cell 3, US Cell 4, US Cell 6, US Cell 7, US Cell 11. During this time, users may be experiencing intermittent issues within our system. Our team is actively investigating this issue and to mitigate the issue. We will provide another update within the next 30 minutes, or sooner if additional information becomes available.
resolved Aug 01, 2025, 03:49 PM UTC

The Workflows Automation and Extensibility team is investigating and has determined that the incident is currently impacting Workflows in US Cell 1, US Cell 2, US Cell 3, US Cell 4, US Cell 6, US Cell 7, and US Cell 11. During this time, workflows may be delayed or timeout. The Workflows Automation and Extensibility team has applied some changes to mitigate the issue and is monitoring for improvements. We'll provide an update in 30 minutes, or sooner, as additional information becomes available.
resolved Aug 01, 2025, 04:14 PM UTC

The Workflows Automation and Extensibility team is investigating and has determined that the incident is currently impacting Workflows in US Cell 1, US Cell 2, US Cell 3, US Cell 4, US Cell 6, US Cell 7, and US Cell 11. During this time, real-time flows are delayed and may time out. Non-real-time flows are now running with normal latency. The Workflows Automation and Extensibility team is applying further mitigations to the issue and monitoring for improvements. We'll provide an update in 30 minutes, or sooner, as additional information becomes available.
resolved Aug 01, 2025, 04:53 PM UTC

The Workflows Automation and Extensibility team is investigating and has determined that the incident is currently impacting Workflows in US Cell 1, US Cell 2, US Cell 3, US Cell 4, US Cell 6, US Cell 7, and US Cell 11. During this time, real-time flows are delayed and may time out. Non-real-time flows are now running with normal latency. The Workflows Automation and Extensibility team is continuing to apply further mitigations to the issue and monitoring for improvements. We'll provide an update in 1 hour, or sooner, as additional information becomes available.
resolved Aug 01, 2025, 06:01 PM UTC

The Workflows Automation and Extensibility team is still monitoring the fix applied to address the incident impacting Workflows in US Cell 1, US Cell 2, US Cell 3, US Cell 4, US Cell 6, US Cell 7, and US Cell 11. Mitigation is progressing, and customers should be seeing fewer delays or timeouts. We'll provide an update in 60 minutes, or sooner, as additional information becomes available.
resolved Aug 01, 2025, 06:57 PM UTC

The Workflows Automation and Extensibility team continues to monitor the fix applied to address the incident impacting Workflows in US Cell 1, US Cell 2, US Cell 3, US Cell 4, US Cell 6, US Cell 7, and US Cell 11. Workflow services should be processing normally at this time. We'll provide an update in 60 minutes, or sooner, as additional information becomes available.
resolved Aug 01, 2025, 08:09 PM UTC

The Workflows Automation and Extensibility team continues to observe steady recovery in the Workflow services that impacted US Cell 1, US Cell 2, US Cell 3, US Cell 4, US Cell 6, US Cell 7, and US Cell 11. We'll provide an update in 60 minutes, or sooner, as additional information becomes available.
resolved Aug 01, 2025, 09:21 PM UTC

The team is continuing to monitor the fix implemented to resolve the recent Workflows incident. While we're seeing clear indications of recovery, we're proceeding with an abundance of caution to ensure full stability. Customer workflows should be seeing improvements, and we're working diligently to ensure everything flows smoothly. We'll share our next update in 120 minutes.
resolved Aug 01, 2025, 11:22 PM UTC

The initial impact has fully recovered on OK14, and we're transitioning from active response to a dedicated monitoring phase on the other cells. Our engineers are continuously tracking system performance to ensure full stability. While we expect services to remain stable, if you experience any Workflow-related errors, please contact our support team for assistance.
resolved Aug 02, 2025, 03:46 AM UTC

The Workflow issues for US Cell 1, US Cell 2, US Cell 3, US Cell 4, US Cell 6, APJ Cell 1, US Cell 11, and US Cell 14 have been addressed. Our monitoring shows a return to normal conditions. Our team is collaborating with our third-party vendor to identify the underlying root cause, and a Root Cause Analysis will be forthcoming.
resolved Aug 11, 2025, 08:13 PM UTC

We sincerely apologize for any impact this incident has caused you, your business, and your customers. At Okta, trust and transparency are our top priorities. The facts regarding this incident are outlined below. We are committed to implementing improvements to the service to prevent future occurrences of this incident. Detection and Impact: On Friday, August 1st, at 4:13 AM PT, Okta was alerted to an issue in which message brokers were quickly exhausting disk space and causing delayed workflow processing in Cells OK1, OK2, OK3, OK4, OK6, OK7, OK11, and OK14. Starting at approximately 5:00 AM PT, users may have experienced processing delays, API timeouts, or Workflow executions ending prematurely with internal server errors. Delayed Workflow executions that were rerouted to healthy infrastructure were processed successfully. Root Cause Summary: Based on our investigation and findings, the root cause of this issue was a disk space leak caused by a duplicate session within services managed by a third-party operator introduced during a maintenance window. Remediation Steps: Immediately upon receiving alerts of service disruptions, Okta Engineering escalated the issues with our provider and worked to implement internal mitigations. While troubleshooting the issue, Okta brought up multiple additional message brokers and swapped them as needed to balance the various waves of requests. Okta's internal mitigations restored service to the affected cells by approximately 12:34 PM PT, with additional stabilization work continuing through the afternoon. Okta continued to work directly with our provider to mitigate the issue and confirmed complete service restoration throughout the weekend. Preventative Actions: Okta will continue working with our third-party service provider to enhance monitoring and communication, expedite detection, and improve processes for making infrastructure changes. Additionally, we are updating our operational procedures for scaling clusters and connections to improve service recovery times further. Duration (# of minutes): 454