Twingate incident

Twingate Service Issues

Twingate experienced a minor incident on March 21, 2025 affecting Authentication - Enterprise and Identity Providers Sync and 1 more component, lasting 1h 15m. The incident has been resolved; the full update timeline is below.

Started: Mar 21, 2025, 07:57 PM UTC
Resolved: Mar 21, 2025, 09:13 PM UTC
Duration: 1h 15m
Detected by Pingoru: Mar 21, 2025, 07:57 PM UTC

Affected components

Authentication - EnterpriseIdentity Providers SyncPublic APIAuthentication - SocialAdmin ConsoleMFAAuthorizationRealtime UpdatesConnector HeartbeatClient Log Upload

Update timeline

investigating Mar 21, 2025, 08:30 PM UTC

We are currently investigating this issue.
monitoring Mar 21, 2025, 08:44 PM UTC

A fix has been implemented and we are monitoring the results.
resolved Mar 21, 2025, 09:13 PM UTC

This incident has been resolved.
postmortem Mar 28, 2025, 03:14 PM UTC

**Twingate Public RCA: March 21, 2025 Authentication/Authorization Incident** **Summary** On March 21, 2025, between 21:10 and 21:41 UTC, a subset of authentication and authorization requests to Twingate services experienced elevated error rates. The issue was mitigated by 21:41 UTC and services returned to normal operation. **What Happened** Twingate runs an active-active architecture across multiple Google Kubernetes Engine \(GKE\) clusters, balanced by a Global Load Balancer \(GLB\). At the onset of the incident, we observed anomalies in one of our clusters, and to protect overall system health, we proactively scaled down deployments in that cluster. This action shifted traffic to other clusters in the topology, primarily one that typically handles a lighter load and had been scaled accordingly. The target cluster began autoscaling as expected, but the increased traffic caused elevated error rates, which triggered our retry mechanisms across services. While these retries helped many requests succeed, they also increased the overall system load. Simultaneously, the cluster underwent a cloud provider-initiated update operation that caused pod restarts and reduced capacity. To stabilize the system, we reintroduced capacity in the previously affected cluster, rebalancing the traffic across regions. Once this occurred, error rates subsided and retries diminished. **Root Cause** Anomalous behavior from our cloud provider, including unexpected request timeouts at the load balancer level and instability during a cluster update, led to a cascade of retry traffic that temporarily overwhelmed parts of the system. We are actively investigating both the unexpected timeout configuration and the behavior of the cluster during the update with our cloud provider. **Corrective actions** Short-Term: * Continue collaboration with our cloud provider to understand the root cause of the unexpected timeouts and cluster update impact. * COMPLETED: Keep deployments on all regions with same number of replicas in HPA configuration * COMPLETED: Increase max node counts on cluster node auto-scalers to give us more room to scale up. * COMPLETED: Configure HPA to scale up faster