SecureAuth incident

Cloud Service Issue

Minor Resolved View vendor source →

SecureAuth experienced a minor incident on September 12, 2024 affecting SMS and Voice and 1 more component, lasting 1h 55m. The incident has been resolved; the full update timeline is below.

Started
Sep 12, 2024, 02:43 AM UTC
Resolved
Sep 12, 2024, 04:38 AM UTC
Duration
1h 55m
Detected by Pingoru
Sep 12, 2024, 02:43 AM UTC

Affected components

SMSVoicePush

Update timeline

  1. investigating Sep 12, 2024, 02:43 AM UTC

    We are investigating an issue with our Cloud Services, and will post updates as we gain understanding to the issue.

  2. identified Sep 12, 2024, 03:33 AM UTC

    We have identified the issue and are in the process of implementing a fix.

  3. monitoring Sep 12, 2024, 04:17 AM UTC

    A fix has been implemented and all issues have been resolved. We are continuing to monitor.

  4. resolved Sep 12, 2024, 04:38 AM UTC

    The incident has been resolved. For any remaining issues or questions please reach out to [email protected].

  5. postmortem Sep 12, 2024, 08:29 PM UTC

    **Polaris Twilight Outage RCA - September 12, 2024** **Problem Description** On September 11, 2024 at 7:16PM, the SecureAuth Cloud Infrastructure encountered widespread connection issues with databases systems which resulted in authentication failures for impacted customers. **Cause** The SecureAuth Cloud Operations team was alerted of connections issues with the Twilight service \(integral service which other microservices are reliant\). Upon investigation, we identified that the service was experiencing database latency due to CPU utilization spikes on the database. The CPU spikes triggered mass restarts of the Twilight Service which in turn caused extended CPU spikes on the database. The root cause was due to legacy dependencies on the database that were negatively affected during a redistribution exercise related to the Vault migration performed on August 29, 2024. Those legacy dependencies were originally determined to be benign, and therefore assumed to have no impact to the customer base after the Vault migration. It was determined that the CPU spikes were caused by the interface between the service and the database in form of health checks that created a snowball effect, resulting in the aforementioned issues with the Twilight service. Due to the nature of this issue, not all customers were immediately impacted; however, the recovery and resolution of this issue impacted all customer cloud services as a result of the scaling operations. **Recovery** To mitigate this issue, the cloud services were scaled down alleviate database pressure. Once the database stabilized, the services were scaled back up in a controlled manner until all services were fully restored. **Timeline:**` `Sep 11, 2024 * 7:16PM PST – Twilight connection issues begin and alerts were triggered * 7:17PM PST – Cloud Operations team join bridge to investigate alerts * 7:27PM PST – Issue is understood and mitigation efforts begin * 7:27PM PST – Scale down of cloud services to alleviate database pressure begins. * 7:40PM PST – Scale down complete and database CPU utilization stabilizes * 7:41PM PST – Controlled \(staggered\) scale up of cloud services begins * 8:30PM PST – Controlled scale up of cloud services is completed * 8:40PM PST – All services in running state * 9:00PM PST – Validation testing complete and incident resolved * Post-9:00PM PST – Continued to monitor closely while working with some customers as needed to resolve intermittent issues caused by the incident. Corrective Actions * Engineering to review and improve the Twilight to Cockroach Database interface and determine a more elegant solution to the health check actions that would diminish the result of mass-restarts of the service during periods of high-usage spikes. * Leadership review of database alternatives to the solution architecture * Improve decision-making accuracy by increasing team knowledge around legacy systems to ensure end to end awareness of potential impacts to assumed benign configuration changes. * Introduce additional gates into the existing CAB \(Change Advisory Board\) process, including additional Engineering leadership, including cross-functional Subject Matter Experts