Twingate experienced a major incident on May 13, 2022 affecting Authentication - Enterprise and Public API and 1 more component, lasting 1h 39m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating May 13, 2022, 03:38 PM UTC
We are currently investigating the incident.
- investigating May 13, 2022, 03:48 PM UTC
We have seen improvements; we are monitoring the situation. We'll update as we find out more details on the issue.
- investigating May 13, 2022, 03:55 PM UTC
We are continuing to investigate this issue.
- investigating May 13, 2022, 03:57 PM UTC
The system is fully up. We will continue to monitor.
- resolved May 13, 2022, 05:17 PM UTC
We are marking this issue resolved. The impact was between 8:31 am to 8:44 am. We will add the post mortem to the incident as soon as we have it ready.
- postmortem May 18, 2022, 12:01 AM UTC
**Summary** At approximately 15:31 UTC on May 13, 2022, we received alerts from our monitoring systems pointing to a problem with Twingate. Our cloud provider’s load balancer started to return 502 \(Bad Gateway Error\) due to issues with our backend system. Looking into our backend logs, we noticed only 10-15% of requests were being handled properly and decided to restart our application pod in our Kubernetes cluster. Once the backend application pod restarted, the load balancer stopped returning 502 errors and things returned to normal around 15:44 UTC. During the outage, both our private and public APIs were affected. These APIs are used to drive most of the functionality that end users and administrators experience in Twingate. Specifically, this means that customers’ admin consoles were not accessible, the public API was not responsive to requests, and Clients and Connectors were unable to initiate authentication. Existing connections continued to function as a part of our reliability efforts completed in Q1 of 2022 \(provided that the clients and connectors were running the latest versions\). With this, we recommend that all of our customers upgrade their Clients and Connectors as soon as they can. **Root cause** After a detailed investigation, we found potential network glitches that caused connectivity issues and higher latency with different and unrelated parts of our system. While some components self-healed \(i.e. our Redis instance\), our main backend application was impacted. This was due to a much higher latency associated with a 3rd party service we use, leading to connection saturation of our API layer and the resulting rejection of additional requests, which manifested as 502 errors to the requestor. **Corrective actions** In order to mitigate the risk of this root cause impacting our service in the future, we have initiated a number of improvements: * Completed: We increased CPU and memory reservation for our backend application and relay pods. We decreased the connection timeout threshold for the third party so it doesn’t cause connection saturation again. * Short-term: We are working on adding more metrics and enabling more logging to help with investigation and post-mortem analysis in the future. * Medium Term: While we already had some circuit breaker capabilities and flags to turn off certain features, we will look for a complete service mesh solution with circuit breaker capabilities that should keep upstream applications and APIs running when issues and latencies arise for downstream dependencies.