Twingate incident

Twingate Service Down

Twingate experienced a critical incident on May 28, 2025 affecting Authentication - Enterprise and Identity Providers Sync and 1 more component, lasting 39m. The incident has been resolved; the full update timeline is below.

Started: May 28, 2025, 08:53 PM UTC
Resolved: May 28, 2025, 09:32 PM UTC
Duration: 39m
Detected by Pingoru: May 28, 2025, 08:53 PM UTC

Affected components

Authentication - EnterpriseIdentity Providers SyncPublic APIAuthentication - SocialAdmin ConsoleMFAAuthorizationRealtime UpdatesConnector HeartbeatClient Log Upload

Update timeline

identified May 28, 2025, 08:53 PM UTC

The issue has been identified and a fix is being implemented.
monitoring May 28, 2025, 09:16 PM UTC

A fix has been implemented and we are monitoring the results.
resolved May 28, 2025, 09:32 PM UTC

This incident is resolved now. We'll provide a post-mortem as soon as we have details.
postmortem Jun 16, 2025, 06:26 PM UTC

### **Incident Duration** **UTC Time**: May 28, 2025, from **20:45 to 21:14** \(29 minutes\) ### **Components Impacted** **Control Plane Services** * Authentication \(Enterprise & Social\) * Multi-Factor Authentication \(MFA\) * Authorization * Connector Heartbeat **Management Plane Services** * Identity Provider Sync * Public API * Admin Console * Real-Time Updates * Client Log Upload * 3rd Party Integrations * Network Dashboards * DNS Filtering Dashboards ### **Summary** On May 28, 2025, Twingate experienced a service disruption that impacted multiple core services for approximately 29 minutes. The issue stemmed from a misconfiguration introduced during a rollout that updated routing rules in our global load balancer. A new routing rule was unintentionally prioritized, resulting in most API traffic being directed to a single backend. This overwhelmed the target service, leading to elevated error rates. Lower environments did not expose this issue due to their reduced traffic levels. The incident was promptly detected and the rollback was initiated quickly. However, recovery was delayed due to infrastructure dependencies that also relied on affected components. A manual intervention using local tooling ultimately restored service. ### **Root Cause** A routing misconfiguration caused a traffic imbalance that overwhelmed a backend service, leading to widespread API failures. ### **Resolution Timeline** * **20:45 UTC** – Incident begins. API traffic misrouted. * **20:48 UTC** – Incident detected and triage begins. * **20:52 UTC** – Rollback initiated. * **21:05 UTC** – Automated rollback process encounters delays. * **21:09 UTC** – Engineers execute manual fix. * **21:14 UTC** – Services fully recovered. ### **Corrective and Preventative Actions** ### **Short-Term \(Completed\)** * ✅ Routing rule corrected. * ✅ Alerts added to detect backend traffic anomalies in lower environments. * ✅ Improved resource scaling thresholds. ### **Mid-Term \(In Progress\)** * ⚙️ Decouple infrastructure tooling from service dependencies to ensure faster recovery options. ### **Long-Term \(Planned\)** * 🔄 Introduce backend sharding to isolate customer workloads and reduce incident impact. We sincerely apologize for the disruption and are taking actions to improve our systems' resilience. We appreciate your continued support and trust.