WorkOS incident

Elevated Platform Errors

WorkOS experienced a major incident on August 21, 2025 affecting Dashboard and SSO and 1 more component, lasting 2h 22m. The incident has been resolved; the full update timeline is below.

Started: Aug 21, 2025, 05:39 PM UTC
Resolved: Aug 21, 2025, 08:02 PM UTC
Duration: 2h 22m
Detected by Pingoru: Aug 21, 2025, 05:39 PM UTC

Affected components

DashboardSSOAdmin PortalDirectory SyncAuthKit

Update timeline

investigating Aug 21, 2025, 05:39 PM UTC

AuthKit is experiencing elevated platform errors. We're currently investigating the issue, and will provide an update soon.
investigating Aug 21, 2025, 05:59 PM UTC

Our platform is experienced elevated errors and latency platform-wide. We're still investigating the issue, and will provide an update soon.
investigating Aug 21, 2025, 07:11 PM UTC

We're continuing to investigate on our side, but our edge networking provider has confirmed problems in the US East region. We’re working with them for updates as they resolve it.”
monitoring Aug 21, 2025, 07:29 PM UTC

Our systems are returning to normal. We’ll continue monitoring closely with our upstream provider.
resolved Aug 21, 2025, 08:02 PM UTC

The incident has been resolved.
postmortem Aug 26, 2025, 11:43 PM UTC

**Date:** August 21, 2025 **Duration:** 17:26 - 19:27 UTC **Status:** Resolved ## **Summary** On August 21, 2025, our platform experienced elevated errors and high latency across several services, including sign-in and API requests. The disruption was caused by network link saturation between one of our upstream providers and our cloud hosting environment. This particular issue with our network provider created a widespread disruption of internet traffic. During the incident, some customers encountered difficulties logging in with username/password or SSO, along with delays and intermittent errors in the API, Dashboard, Admin Portal, and Directory Sync. While the extent varied over time, a meaningful portion of requests were degraded or failed entirely. ## **What Happened** The disruption stemmed from connectivity issues between our upstream network provider and our cloud infrastructure provider. This caused elevated latency, timeouts, and service unavailability until traffic conditions were stabilized. ## **Timeline** 17:26 UTC – Elevated errors observed and internal incident created 17:47 UTC - Root cause identified as a network issue with our upstream provider 17:52 UTC - Team begins work on mitigation of secondary impacts due to increased number of open requests, while monitoring network conditions 19:27 UTC - Problem with upstream network provider is resolved and response times return to normal ## **Remediation** Our engineering team investigated multiple paths to remediation while closely monitoring the upstream provider’s recovery efforts. Service performance returned to normal once network stability was restored, after which we transitioned to monitoring to ensure ongoing stability. To improve resilience and reduce the likelihood of similar issues in the future, we are: * Improving observability at the network boundary between our infrastructure and upstream providers. * Expanding our monitoring and alerting to more quickly detect issues with our upstream network. * Evaluating architectural changes, including multi-region deployment and reducing reliance on single external providers, to further enhance reliability. ## **Conclusion** We regret the significant impact this incident had on you and your customers. We are committed to implementing lasting improvements to ensure greater stability going forward, and will keep you updated on our progress.