Twingate incident

Controller downtime

Twingate experienced a critical incident on July 29, 2021 affecting Connector Heartbeat, lasting 1h 45m. The incident has been resolved; the full update timeline is below.

Started: Jul 29, 2021, 03:15 PM UTC
Resolved: Jul 29, 2021, 05:00 PM UTC
Duration: 1h 45m
Detected by Pingoru: Jul 29, 2021, 03:15 PM UTC

Affected components

Connector Heartbeat

Update timeline

identified Jul 29, 2021, 05:44 PM UTC

We've identified the issue, which is being caused by excessive memory usage on our infrastructure.
monitoring Jul 29, 2021, 05:44 PM UTC

We're resolved the immediate issue by adding additional processing capacity and increasing memory limits on our controller infrastructructure.
monitoring Jul 29, 2021, 05:47 PM UTC

The system is now confirmed as fully operational. We are working on an incident report and taking steps to ensure that this issue will not happen in the future.
resolved Jul 29, 2021, 05:47 PM UTC

This incident has been resolved.
postmortem Jul 30, 2021, 12:57 AM UTC

**Components impacted** Controller **Summary** The controller was unavailable to service new authentication requests from approximately 15:17 to 15:19 UTC. The result was that during this period of time, new connection requests were rejected. Existing connections were not impacted. At the end of the outage period, normal service resumed with no remediation required. **Root cause** Leading up to the start of the outage period, automated monitoring alerted us to spikes in memory usage. At approximately 15:16 UTC we introduced a change to our cluster that was intended to increase memory availability. At approximately 15:17 UTC as this change was rolled out, it had the unintended consequence that resulted in a decrease in service availability, with the resulting rejection of most requests. **Corrective actions** At 15:18 UTC, seeing the decrease in service availability, we reverted the change and simultaneously made additional hardware available to the cluster. Normal service resumed approximately 45 seconds later as the change propagated. Looking ahead, we plan to: 1. Investigate introducing decoupling between inbound requests and our backend as the likely cause of the memory spikes that triggered the change that caused the outage.