Okta incident

Extended slowness in US-Cell 2

Minor Resolved View vendor source →

Okta experienced a minor incident on May 1, 2024 affecting Core Platform, lasting 7d 16h. The incident has been resolved; the full update timeline is below.

Started
May 01, 2024, 08:24 AM UTC
Resolved
May 09, 2024, 12:32 AM UTC
Duration
7d 16h
Detected by Pingoru
May 01, 2024, 08:24 AM UTC

Affected components

Core Platform

Update timeline

  1. resolved May 01, 2024, 08:24 AM UTC

    Issue experiencing slowness and request/response failures 50Xs in OK2 starting at 1:01 am PDT has been addressed in approximately 20 minutes. Additional root cause information will be available within 5 Business days. Affected cells: okta.com:2

  2. resolved May 01, 2024, 08:30 AM UTC

    Issue experiencing slowness and request/response failures 50Xs in OK2 starting at 1:01 am PDT has been addressed in approximately 20 minutes. Additional root cause information will be available within 5 Business days.

  3. resolved May 09, 2024, 12:31 AM UTC

    We sincerely apologize for any impact this incident has caused to you, your business, or your customers. At Okta trust and transparency are our top priorities. Outlined below are the facts regarding this incident. We are committed to implementing improvements to the service to prevent future occurrences of this incident. Detection and Impact On May 1st, at 1:01 am PDT, Okta’s monitoring system alerted our team to an issue where some users experienced increased error rates, slow response times, and may have received HTTP 500 response code errors in US Cell OK2. The service was in read-only mode continuing to support authentication flows and read only operations, however write operations would fail during this period. Root Cause Summary Based on our investigation and findings, the root cause of the issue was a sudden increase in system resources usage which caused a primary database to stop responding. Remediation Steps Okta immediately implemented mitigations by reducing overall resource utilization load on the impacted service, and executing a database failover per our standard procedure. As of May 1st at 1:22 am PDT, the service returned to normal operation. Preventative Actions Okta is taking action to improve capacity and alerting capabilities. Engineering teams are actively focused on isolating and curing the underlying resource contention issue, and have added new guidance to the operational processes further to improve time to service recovery. Total Duration Total Duration (Minutes): 21 minutes