Twingate incident

Twingate Controller outage

Twingate experienced a notice incident on June 3, 2022 affecting Authentication - Enterprise and Public API and 1 more component, lasting 2h 32m. The incident has been resolved; the full update timeline is below.

Started: Jun 03, 2022, 04:16 AM UTC
Resolved: Jun 03, 2022, 06:49 AM UTC
Duration: 2h 32m
Detected by Pingoru: Jun 03, 2022, 04:16 AM UTC

Affected components

Authentication - EnterprisePublic APIAdmin ConsoleAuthorizationConnector Heartbeat

Update timeline

investigating Jun 03, 2022, 04:16 AM UTC

Controller is currently experiencing an outage. Our team is investigating the issue.
investigating Jun 03, 2022, 04:54 AM UTC

During a planned Kubernetes version upgrade, our application started to fail. We failed to our standby region/cluster, but it has the same issue. We are downgrading Kubernetes version and continuing to working on the issue.
monitoring Jun 03, 2022, 05:13 AM UTC

After reverting back the Kubernetes version and failing back to our previously Active cluster, we see Twingate Service recovered. We continue to monitor.
resolved Jun 03, 2022, 06:49 AM UTC

We are marking the incident as resolved. We will provide post-mortem notes as soon as we have them.
postmortem Jun 10, 2022, 12:05 AM UTC

**Summary** On June 3rd at 3 AM UTC, Twingate started a regular Kubernetes upgrade on its main cluster. This maintenance is usually done once a month and has been performed successfully many times prior to this upgrade. It includes a version upgrade of the cluster followed by a version upgrade of node-pools, completed one at a time. Around 4:01 AM UTC, HTTP 502 errors started on our cloud load balancer instance indicating an issue with the service. While these errors were a small portion of the overall volume of requests at first, around 4:15 AM the system was fully overloaded and it turned into a full outage. Shortly after, we failed to our standby cluster in a different region of our cloud provider, but saw the same issue happening on our standby cluster too. We downgraded our active clusters node pools to the previous Kubernetes version. This added extra capacity and then we failed back to our active cluster at 5:05 AM UTC. Recovery started immediately and the service was fully recovered at 5:10 AM UTC. **Root cause** After a detailed investigation, we found that during the upgrade, network connectivity between internal components was not stable, triggering failures and retries. As a result of our application being overloaded, we failed to answer load balancer health checks which caused the 502 errors. We are working with our cloud provider to analyze why the network instability happened during the upgrade. **Corrective actions** We have initiated a number of improvements: * Completed: We increased our main application capacity, tuned application and network settings between mentioned services, upgraded our in-memory key-value store, and added PDB \(pod destruction budget\). * Short-term: We will continue to tune the application and network settings between various components of Twingate. We found a bug with how our our client handles 502 errors and we are working on our client to handle the 502 errors better. * Medium-term: We are looking into two major changes: 1\) implementing Circuit Breaker functionality so our main application can stay up when a downstream service goes down, and 2\) implement a multi-region active-active setup on our cloud provider, which will enable us to better control Kubernetes upgrades \(as well as other code and configuration changes\).