Teleport incident

Connectivity Issues

Minor Resolved View vendor source →

Teleport experienced a minor incident on March 30, 2023 affecting Cloud Service, lasting 18h 18m. The incident has been resolved; the full update timeline is below.

Started
Mar 30, 2023, 12:17 AM UTC
Resolved
Mar 30, 2023, 06:36 PM UTC
Duration
18h 18m
Detected by Pingoru
Mar 30, 2023, 12:17 AM UTC

Affected components

Cloud Service

Update timeline

  1. investigating Mar 30, 2023, 12:17 AM UTC

    The Teleport Cloud operations team is investigating reports of connectivity issues.

  2. investigating Mar 30, 2023, 05:21 PM UTC

    We're continuing to investigate intermittent connectivity issues.

  3. monitoring Mar 30, 2023, 05:51 PM UTC

    A fix has been implemented and we are monitoring the system.

  4. resolved Mar 30, 2023, 06:36 PM UTC

    This incident has been resolved. A postmortem is in progress.

  5. postmortem Apr 06, 2023, 09:43 PM UTC

    On Wednesday, March 29, 2023 at 20:45 UTC, multiple tenants began experiencing connectivity issues with Teleport Cloud. Monitoring indicated that connections were terminating and reconnecting primarily in the AWS us-west-2 region. These issues continued until Thursday, March 30, 2023 at 17:00 UTC. The root cause of this incident was a routing misconfiguration by AWS. This “large scale regional event” caused traffic from us-west-2 to move from Oregon Edge \(HOI\) to Seattle Edge locations \(SEA4 and SEA19\), which triggered AnyCast shifts between the two Seattle Edge locations resulting in connection timeouts for traffic originating from us-west-2. AWS responded to this incident by rerouting traffic originating from us-west-2 to Oregon Edge instead of Seattle Edge locations which reduced the number of connection terminations due to AnyCast shifts. Teleport Cloud Operations responded to this event on Wednesday, March 29, 2023 at 21:30 UTC as customers began noticing connectivity issues. The team was able to observe connections flapping for multiple tenants and began diagnosing the cause. The first observation was that this incident behaved in a similar manner to the incident that occurred on February 6th, which was the result of AWS Global Accelerator maintenance, so the team opened a support case with AWS at 22:02 UTC. The team continued to observe the platform and discussed mitigation options. On Thursday March 30th at 2:10 UTC, the team cycled the load balancer pods in us-west-2 but that did not resolve the issue. Later, at 5:20 UTC, the team cycled internal CNI agents to clear their routing cache but that also did not have an impact. At 6:47 UTC, the team manually updated AWS Global Accelerator configurations in us-west-2 to route new connections to another region and cycled a single tenant’s proxy pods in us-west-2 to force reconnects. By 7:24 UTC, this configuration change proved unsuccessful and the configurations were reverted. On Thursday, March 30, 2023 at 8 UTC, the incident was handed off from the US support team to the EU support team with a focus on gathering more details from AWS. On Thursday, March 30, 2023 at 17:00 UTC, AWS fixed a routing issue which resulted in stable connectivity for Teleport Cloud customers. The Teleport Cloud team continues to work with AWS to discuss options for detecting AnyCast shifts, or any Global Accelerator maintenance, with the goal of detecting such issues and responding with more urgency. AWS has indicated that they plan to update their routing policy to prevent the occurrence of AnyCast shifts at the Seattle Edge locations in the future, as well as adding metrics and alarms to detect AnyCast shifts. The Teleport Cloud team has also been testing the ability to disable TCP termination for all Global Accelerator instances which is expected to further stabilize connectivity for Teleport Cloud tenants.