Siteimprove incident
Siteimprove Platform Login Errors (SSO Users)
Siteimprove experienced a major incident on May 6, 2026 affecting Platform, lasting 1h 14m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating May 06, 2026, 02:53 PM UTC
We are currently experiencing issues where some customers are unable to log in to the Siteimprove platform using SSO via siteimprove.com and my2.siteimprove.com. Impact: - Users who are already logged in should not experience disruption unless they log out or their session times out. - New login attempts may fail. Our Development Team is actively working to identify the root cause and resolve the issue as quickly as possible. We apologize for this inconvenience. We'll continue to keep you updated.
- identified May 06, 2026, 03:07 PM UTC
Good news! We’ve found the issue and we’re working on a fix. Thanks for your patience! We'll continue to keep you updated.
- monitoring May 06, 2026, 03:30 PM UTC
A fix has been implemented and we are monitoring the results. SSO Users should now be able to successfully log into the Siteimprove platform via siteimprove.com and my2.siteimprove.com disruption free.
- resolved May 06, 2026, 04:08 PM UTC
The issue affecting SSO logins to the Siteimprove platform is all sorted. We’re sorry for holding you up! If you have any additional questions or feedback, please submit a new support ticket through our Help Center ("?" button) located at the top right-hand-side of the Siteimprove platform.
- postmortem May 14, 2026, 05:54 PM UTC
**Executive Summary:** On May 6, 2026, an infrastructure failure in our Kubernetes environment caused SSO login failures for a subset of users for approximately 42 minutes. All other platform functionality was unaffected. The issue was caused by a loss of quorum in an internal cluster's control plane, compounded by a node that had been silently degraded. The incident was resolved by replacing the affected nodes, and the system has been operating normally since. **Incident Overview:** The issue originated from an internal infrastructure cluster that acts as a connectivity bridge between our environments. Two of the cluster's three control plane nodes became non-functional — one had been in a degraded state, and a second failed during normal operations. With two of three nodes down, the cluster could no longer coordinate its workloads, which disrupted the network path required for SSO logins for some users. Some of our internal tooling was also affected. To resolve the issue, the team replaced the unresponsive nodes, which restored normal operations. Login functionality recovered shortly after. **Impact:** Some users logging into the platform via SSO experienced failures from approximately **2026-05-06T14:34Z** to **2026-05-06T15:16Z** \(~42 minutes\). **Detection:** The incident was detected at **2026-05-06T14:34Z** when our automated monitoring system identified connectivity failures in the login journey, which alerted the operations team. **Response:** Our operations team began investigating immediately. The affected control plane nodes were identified and replaced, with full recovery confirmed via automated tests at **2026-05-06T15:16Z**. A status page update was posted during the incident. **Root Cause:** Two of three control plane nodes in an internal cluster became unavailable at the same time — one had been silently degraded, and a second failed independently. The monitoring pipeline that should have detected the degraded node ahead of time was itself not functioning correctly. Because it only logged on failure and not on success, its silence was indistinguishable from normal operation, and the team had no way to know it had stopped working. Follow-up actions updated the monitoring pipeline to emit success signals so that their absence can be alerted on, improving overall control plane monitoring coverage, and introducing scheduled control plane node replacements as a preventive measure.