Teleport incident

Some clusters failing to restart following Tenant Upgrade

Major Resolved View vendor source →

Teleport experienced a major incident on May 4, 2023 affecting Cloud Service, lasting 2h. The incident has been resolved; the full update timeline is below.

Started
May 04, 2023, 02:20 PM UTC
Resolved
May 04, 2023, 04:21 PM UTC
Duration
2h
Detected by Pingoru
May 04, 2023, 02:20 PM UTC

Affected components

Cloud Service

Update timeline

  1. identified May 04, 2023, 02:20 PM UTC

    We've identified an issue where some customer Teleport clusters are failing to restart following Tenant Upgrade. We're working on restoring services for affected customers.

  2. identified May 04, 2023, 03:00 PM UTC

    We will be rolling back all customers who were upgraded yesterday due to an issue with access requests. Customers currently running 12.3.1 will be rolled back to 12.2.4. As a result, all active sessions to Teleport Cloud will be terminated. Teleport agents will automatically reconnect to their respective clusters. Connectivity may be disrupted for 2-10 minutes depending on the number of connected resources in the cluster.

  3. resolved May 04, 2023, 04:21 PM UTC

    Roll backs of customer Teleport clusters have been completed.

  4. postmortem May 05, 2023, 03:17 PM UTC

    Following a routine upgrade of customer Teleport clusters to v12.3.1 a few tenants' Auth servers were failing healthy start up. It was determined that a change to the Okta access request code was leading to a panic. The affected tenants were rolled back. To prevent any other tenants from impact \(upon creation of an access request\) the balance of tenants were also downgraded to the previous version \(v12.2.4\). Three tenants were affected by the inability to start their Auth service. Their service was disrupted for up to 2.5 hours. The remaining tenants had their services disrupted during remediation leading to an unplanned period of up to 10 minutes of service unavailability.