Teleport experienced a major incident on February 20, 2025 affecting Cloud Service and Cloud Signups, lasting 5h 59m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Feb 20, 2025, 08:14 PM UTC
We are currently investigating connectivity degradation on our infrastructure.
- identified Feb 20, 2025, 08:34 PM UTC
We've identified the issue and have confirmed that connectivity disruption is only affecting a subset of tenants. Additionally, new signups are currently unavailable until the issue is resolved.
- monitoring Feb 21, 2025, 01:06 AM UTC
Services have been restored in all regions.
- resolved Feb 21, 2025, 02:14 AM UTC
This incident has been resolved.
- postmortem Mar 03, 2025, 09:58 PM UTC
Teleport Cloud customers experienced a service disruption intermittently impacting connectivity for about 40 minutes. The root cause of the incident was a bug in the Envoy Gateway ingress service that resulted in excessive memory utilization and ultimately restarts. The incident was triggered by the restart of all tenant services due to an undocumented background EKS storage migration that occurred during a scheduled maintenance period to upgrade the Kubernetes control plane. In response to the incident, the Teleport team shut down ingress controllers which prevented routing updates from triggering ingress service OOM events. This reduced customer impact by stabilizing ingress connectivity. Once the EKS storage migration and tenant service restarts were completed, the team rolled back the Envoy gateway version which stabilized the platform.