StrongDM incident

We are seeing sporadic issues with users needing to login multiple times to be authenticated.

StrongDM experienced a major incident on September 6, 2024 affecting Admin UI, lasting 39m. The incident has been resolved; the full update timeline is below.

Started: Sep 06, 2024, 03:19 PM UTC
Resolved: Sep 06, 2024, 03:59 PM UTC
Duration: 39m
Detected by Pingoru: Sep 06, 2024, 03:19 PM UTC

Affected components

Admin UI

Update timeline

investigating Sep 06, 2024, 03:19 PM UTC

We are currently investigating this issue and will update here with more information.
investigating Sep 06, 2024, 03:27 PM UTC

We are continuing to investigate this issue.
monitoring Sep 06, 2024, 03:31 PM UTC

The issue has been identified and a fix has been implemented. Normal operations should resume. We will continue to monitor and provide further updates here.
monitoring Sep 06, 2024, 03:38 PM UTC

We are continuing to monitor for any further issues.
monitoring Sep 06, 2024, 03:39 PM UTC

The US Control Plane was experiencing intermittent authentication issues affecting all users, as well as listing available resources. The issue presented by requiring a user to authenticate multiple times before they are allowed into the AdminUI or the SDM Client. We have remediated the source of the issue and are continuing to monitor for any additional errors.
resolved Sep 06, 2024, 03:59 PM UTC

The incident is considered resolved as we have seen no additional errors. We will be performing an internal post-mortem/RCA and an incident after action review next week.
postmortem Sep 12, 2024, 08:38 PM UTC

A recent update revealed an underlying issue with our replica database. Internal alerts flagged the problem and the incident was declared when StrongDM received customer support tickets. To resolve the issue and prevent recurrence, our Infrastructure team made improvements to query monitoring and adjusted how we manage replica lag within RDS. **Incident Timeline:** * **Sep 5, 20:49 UTC** - Change deployed * **Sep 6, 13:50 UTC** - First problem report from internal alerts * **Sep 6, 15:15 UTC** - Tickets from two customers, incident declared, replica disabled * **Sep 6, 15:59 UTC** - Incident resolved