Twingate incident

Controller downtime

Twingate experienced a notice incident on November 5, 2021 affecting Connector Heartbeat, lasting 11h 52m. The incident has been resolved; the full update timeline is below.

Started: Nov 05, 2021, 06:37 PM UTC
Resolved: Nov 06, 2021, 06:30 AM UTC
Duration: 11h 52m
Detected by Pingoru: Nov 05, 2021, 06:37 PM UTC

Affected components

Connector Heartbeat

Update timeline

investigating Nov 05, 2021, 06:37 PM UTC

We are looking into it we will provide more info as soon as we have it
investigating Nov 05, 2021, 06:49 PM UTC

We are continuing to investigate this issue.
investigating Nov 05, 2021, 06:53 PM UTC

We are back and operational now. We are still investigating the root cause.
investigating Nov 05, 2021, 06:54 PM UTC

We are continuing to investigate this issue.
monitoring Nov 05, 2021, 09:05 PM UTC

We are still investigating the root cause of the incident. We didn't find any issue on our side, and we are working with our cloud provider support team to investigate the matter further.
resolved Nov 06, 2021, 06:30 AM UTC

This incident has been resolved.
postmortem Nov 09, 2021, 04:15 AM UTC

**Components impacted** Relay Controller **Summary** A physical hardware failure occurred in a node within one of our Eastern US Relay clusters at approximately 18:19 UTC. Connectors and Clients attached to this node automatically failed over to a new node. This failover process resulted in a partial outage of the Controller, which was partially available to service requests from approximately 18:21 to 18:40 UTC. At the end of the period, normal service resumed with no remediation required. **Root cause** A physical hardware failure occurred in a single node within one of our Eastern US Relay clusters. Although the hardware was swapped out automatically by our service provider, this resulted in all Connectors attached to this particular Relay node to automatically failover to a new Relay node, resulting in a flood of connection requests. This process proceeded normally, however the volume of connection requests was sufficient in this particular instance to temporarily prevent the Controller from accepting new connection requests. This in turn resulted in additional reconnection requests, exacerbating the original problem. **Corrective actions** As soon as we received monitoring alerts, the DevOps and on-call engineering teams started triaging the issue. Additional nodes were started to handle the spike in connection requests and the system was monitored as the request rate recovered and the system was brought back to a normal running state at 18:40 UTC. Looking ahead, we have already or plan to: 1. Add additional nodes and increase memory limits across the board to serve as an additional buffer for failover-based connection spikes. 2. Make changes to our heartbeat monitoring logic to increase overall resilience during transient traffic peaks. 3. Introduce changes to the Connector logic to maintain connections to multiple Relay nodes at all times, resulting in a flatter spike in failover re-connection requests. 4. Introduce additional resiliency in token issuance to prevent temporary spikes in connection requests from influencing otherwise healthy Clients and Connectors.