Twingate experienced a major incident on August 10, 2021 affecting Connector Heartbeat, lasting 1h 38m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- monitoring Aug 10, 2021, 08:10 PM UTC
The Controller infrastucture was experiencing degraded availability. The issue began at 19:39 UTC and continued until 19:47 UTC. Our team is currently investigating the root cause of the issue, and we will post additional updates here.
- monitoring Aug 10, 2021, 08:11 PM UTC
The Controller is currently fully available, and we are actively investigating the root cause of the issue.
- resolved Aug 10, 2021, 09:24 PM UTC
This incident has been resolved. We will be posting a post mortem description shortly.
- postmortem Aug 12, 2021, 11:37 PM UTC
**Components impacted** Controller **Summary** The Controller was partially unavailable to service requests from approximately 19:39 to 19:46 UTC. The result was that during this period of time access to protected resources was limited, some existing connections were dropped, and new connections were refused. At the end of the period, normal service resumed with no remediation required. **Root cause** Leading up to the start of the incident was a planned maintenance period. The maintenance change propagated a configuration change across our Relay clusters. Due to human error that change was not applied in a sequential manner one cluster at a time but was released instead to all of our US clusters in parallel. Once the configuration changed was applied, it triggered reconnection requests from all active Clients and Connectors to our Relay infrastructure. As part of the reconnect process, Clients and Connectors needed to obtain new tokens from the Controller. At 19:39 UTC the spike of requests triggered our health check system, which incorrectly determined that the Controller was misbehaving and required restarting. The frequent Controller restarts had the unintended consequence that resulted in a decrease in service availability. **Corrective actions** As soon as the health-check system kicked in, the DevOps and on-call engineering teams started tracing down the issue. Logs and system metrics confirmed that except for health-check system, everything was performing well, so a decision was made to disable it. Seconds after disabling it, the system returned to a fully operation state. At 21:22 UTC a hot fix was deployed to the health-check system and it was enabled once again. Looking ahead, we plan to: 1. Only perform planned Relay maintenance operations that require connection migration outside of peak traffic hours. 2. Enforce a stricter limit of the number of parallel Relay cluster deployments. 3. Fix issues identified with our health check system and improve our performance and stress-testing to include more aggressive connection migration scenarios. 4. Update the Twingate status page immediately upon confirmation of an issue impacting customers.