Teleport experienced a major incident on February 26, 2023 affecting Cloud Service, lasting 9h 54m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Feb 26, 2023, 07:24 AM UTC
We are currently investigating degraded regional connectivity for some customers.
- identified Feb 26, 2023, 08:17 AM UTC
We are working to resolve the connectivity issues across our regions.
- monitoring Feb 26, 2023, 09:42 AM UTC
We have observed restored connectivity in affected regions. Continuing to monitor the cluster.
- resolved Feb 26, 2023, 05:19 PM UTC
The incident has been resolved. No further issues have been observed in affected regions.
- postmortem Mar 06, 2023, 10:52 PM UTC
On Saturday, February 25, 2023, Teleport Cloud upgraded machine images in all regions. This maintenance required all cloud components and tenant clusters to be cycled. The rollout was done one region at a time which occurred between 15:00 UTC and 20:00 UTC. On Sunday, February 26, 2023, at 3:00 UTC, Teleport Cloud received an escalation from a customer of latency in the APAC region. This was resolved by cycling the customer's proxy servers. Later that morning at 6:30 UTC, Teleport Cloud received another escalation from multiple customers about an inability to login. Support staff observed elevated error rates on global ingress services in multiple regions. Those ingress services were cycled, one region at a time, between 7:50 UTC and 9:00 UTC. Error rates subsided and the incident was resolved. On Monday, February 27, 2023, the Teleport Cloud Engineering team began investigating the root cause of the outage. Later that morning, the team had reproduced the same behavior in the Teleport Cloud staging environment. The cause was a DNS failure within the global ingress service. Cycling machines also cycled internal CNI DNS services, which changed the IP address associated with the DNS service. The global ingress service only updates its DNS service IP when its configuration is reloaded, so the service failed to resolve incoming connections until the service was cycled. This outage exposed a number of areas the Teleport Cloud team is actively working to resolve. 1. This failure was not detected by the team's observability stack. This has since been resolved and deployed. 2. This issue was not caught when performing the same maintenance in the team's testing and staging environments. To mitigate, the team has added additional steps to the testing checklist that includes verifying tenant clusters are operational and users can login by testing with a canary tenant. 3. The operational procedure for rolling out the next set of machine images has been updated to include verification that the global ingress service's configuration is reloaded after the DNS service IP changes. The Teleport Cloud team has already scheduled a project this quarter focused on enhancing the global ingress service which will increase reliability for all customers and avoid the need to reload configurations. Additional scope has been added to that project to evaluate short-term solutions to completely mitigate this issue the next time machine images require updating.