Teleport incident

Connectivity Disruption

Minor Resolved View vendor source →

Teleport experienced a minor incident on May 15, 2023 affecting Cloud Service, lasting 9h 48m. The incident has been resolved; the full update timeline is below.

Started
May 15, 2023, 04:26 PM UTC
Resolved
May 16, 2023, 02:15 AM UTC
Duration
9h 48m
Detected by Pingoru
May 15, 2023, 04:26 PM UTC

Affected components

Cloud Service

Update timeline

  1. investigating May 15, 2023, 04:26 PM UTC

    We are currently experiencing an elevation in connectivity errors on some regional proxies. We are working towards a resolution and determining the full impact.

  2. identified May 15, 2023, 06:54 PM UTC

    We have identified remediation and are working on rolling out the fix for impacted tenants.

  3. identified May 15, 2023, 08:14 PM UTC

    We are continuing to work on a fix for this issue.

  4. monitoring May 15, 2023, 09:44 PM UTC

    We've taken action to stabilize tenants that were impacted and are continuing to monitor.

  5. resolved May 16, 2023, 02:15 AM UTC

    This incident has been resolved.

  6. postmortem May 25, 2023, 06:03 PM UTC

    On May 13th at 01:24 UTC, during verification of a Teleport Cloud platform release, regional proxy services for a subset of tenants lost connectivity with the Teleport Auth service. Connectivity was restored by restarting Auth services for those tenants. On May 15th at 15:00 UTC, a review of the incident found additional tenants experiencing similar symptoms. The issue was traced to an internal cloud component responsible for caching Teleport Auth service IP addresses to facilitate multi-region connectivity. Restarting Auth services for the impacted tenants refreshed the IP cache allowing regional proxy services to connect. Further diagnosis produced a set of performance improvements for the IP address cache component. These action items are in progress and scheduled for release in the coming weeks. Due to the cloud platform running Teleport in high availability mode, connectivity for the majority of tenants and users remained stable because at least one Proxy service was healthy in an impacted region.