SecurID incident

SecurID Service Incident (NA Region)

SecurID experienced a major incident on September 5, 2023 affecting na.access Administration Console and na.access Authentication Service and 1 more component, lasting 2h 11m. The incident has been resolved; the full update timeline is below.

Started: Sep 05, 2023, 01:41 PM UTC
Resolved: Sep 05, 2023, 03:53 PM UTC
Duration: 2h 11m
Detected by Pingoru: Sep 05, 2023, 01:41 PM UTC

Affected components

na.access Administration Consolena.access Authentication Servicena2.access Administration Consolena2.access Authentication Servicena3.access Administration Consolena3.access Authentication Servicena4.access Administration Consolena4.access Authentication Service

Update timeline

investigating Sep 05, 2023, 01:41 PM UTC

We have detected an issue affecting SecurID. SaaS Operations is investigating the issue and will post updates as they become available.
monitoring Sep 05, 2023, 02:56 PM UTC

The issue affecting SecurID has been corrected. The SaaS Operations team is monitoring the fix. We will post a root cause analysis as soon as it is available.
monitoring Sep 05, 2023, 02:56 PM UTC

We are continuing to monitor for any further issues.
resolved Sep 05, 2023, 03:53 PM UTC

After monitoring the fix, SaaS Operations has determined that the incident affecting SecurID has been resolved. We will post a root cause analysis as soon as it is available.
postmortem Sep 08, 2023, 08:16 PM UTC

**PRELIMINARY RCA** On September 5th, 2023, between 13:47 and 14:09 UTC, customers in our NA region encountered an Authentication and Administration Service disruption. This was followed by a period until 14:41 UTC where customers may have experienced degraded service, depending on their DNS caching configurations. This incident was triggered by failures in some nodes within our Web Application Firewall \(WAF\) cluster, leading to a performance degradation and resource exhaustion. Traffic on impacted nodes slowed down and eventually failed. As part of our restoration process, we reverted to a known good configuration temporarily, causing some customers to briefly encounter an expired SSL certificate. Subsequently, the cluster was fully restored to a healthy state. To minimize downtime, we initiated a failover to our secondary site at 14:09 UTC, restoring Authentication and Administration services there. Meanwhile, our Operations team continued to mitigate the incident at the primary site. By 14:41 UTC the mitigation was complete and traffic was restored to our primary site. In response to this incident, RSA is actively enhancing the ID Plus service and related processes. Our steps include: * Ongoing evaluation of best-of-class technology for third-party components. We are already in the process of replacing our WAF solution \(currently targeted for completion January 2024 or earlier\). * Implementing additional WAF performance and stability improvements in September \(planned prior to the incident\). * Collaborating with vendors to conduct a comprehensive Root Cause Analysis \(RCA\) of the WAF failure and implementing additional mitigations. * Encouraging customers to ensure both primary and secondary regions are reachable from on-premises infrastructure, with an enhanced validation feature already available in the August IDR. * Enhancing failover capabilities in the next Identity Router release, enabling more rapid switchover regardless of DNS caching configurations. * Continuing to review ID Plus service logs and customer logs for potential additional mitigations to be included in the final RCA.