WorkOS experienced a major incident on October 10, 2025 affecting SSO and Audit Logs and 1 more component, lasting 40m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Oct 10, 2025, 04:21 PM UTC
We are investigating an issue with our API. We apologize for the inconvenience and will share an update once we have more information.
- identified Oct 10, 2025, 04:35 PM UTC
We’ve identified the issue and implemented a fix.
- monitoring Oct 10, 2025, 04:52 PM UTC
Services are returning to normal, and our team is continuing to monitor the situation.
- resolved Oct 10, 2025, 05:02 PM UTC
This incident has been resolved.
- postmortem Oct 15, 2025, 12:19 AM UTC
## **Summary** On October 10, 2025, error rates increased across AuthKit and SSO API endpoints. At peak, 28% of AuthKit authentication API requests failed. For customers using AuthKit Sessions, failure rates peaked at 55%. SSO endpoints experienced 0.18% failure rates. During the incident, end users may have experienced errors when attempting to complete authentication flows, and authenticated sessions may have ended prematurely. ## What Happened WorkOS has historically relied on a third-party vendor to manage data encryption of application secrets, such as client key pairs. We are now in the process of migrating from this third-party vendor to our own product, WorkOS Vault. On October 10, we began a migration of client key pairs to Vault, causing a dramatic increase in traffic to Vault's public API. This increase triggered one of our public API rate limits, resulting in throttled requests to Vault. API requests depending on data encryption — primarily authentication-related requests — subsequently resulted in intermittent errors. ## **Timeline** | **Time \(UTC\)** | **Event** | | --- | --- | | 16:05 | Primary impact window begins. Elevated errors observed for Vault-dependent flows. | | 16:13 | Incident opened. | | 16:28 | Mitigation applied. Services begin recovering. | | 16:30 | APIs return to normal operation. | | 17:02 | Incident marked as resolved. | ## Remediation Once identified, we modified rate limit rules to account for the expected increase in internal API traffic. Shortly after the incident, we added improved alerting at the network edge to decrease time to detection. In addition, we are prioritizing the following work: * Implementing more fine-grained controls for data migrations to allow for incremental rollouts * Re-routing internal API requests through internal paths ## Conclusion We recognize and apologize for the significant impact this incident had on you and your customers. We're committed to implementing lasting improvements to ensure greater stability going forward.