Twingate incident

Service provider networking outage

Twingate experienced a critical incident on November 16, 2021 affecting Americas Relays and Connector Heartbeat, lasting 4h 47m. The incident has been resolved; the full update timeline is below.

Started: Nov 16, 2021, 05:53 PM UTC
Resolved: Nov 16, 2021, 10:41 PM UTC
Duration: 4h 47m
Detected by Pingoru: Nov 16, 2021, 05:53 PM UTC

Affected components

Americas RelaysConnector Heartbeat

Update timeline

investigating Nov 16, 2021, 05:53 PM UTC

We are investigating an outage report with regards to Twingate. At this time we suspect it is an issue affecting broader Internet services and is not isolated to Twingate. We will continue to post regular updates as we learn more.
investigating Nov 16, 2021, 06:06 PM UTC

We are continuing to investigate this issue.
investigating Nov 16, 2021, 06:16 PM UTC

The Twingate admin console is now accessible and the Twingate Controller is operational. Customers may need to restart Connectors to restore connectivity to resources due to the nature of the networking outage. The originating cause appears to be related to an outage in Google Cloud Platform's Networking service. Google Cloud Platform has opened an incident: https://status.cloud.google.com/incidents/6PM5mNd43NbMqjCZ5REh
monitoring Nov 16, 2021, 06:29 PM UTC

We are continuing to monitor the status of the service. Customers may need to restart Connectors to restore connectivity to resources due to the nature of the networking outage.
monitoring Nov 16, 2021, 06:29 PM UTC

We are continuing to monitor for any further issues.
monitoring Nov 16, 2021, 07:08 PM UTC

We have verified that all of our infrastructure is fully operational at this time and will continue to monitor for any changes. Until our service provider (Google Cloud Platform) has closed their incident, we will leave this incident open in Monitoring status and provide regular updates as we receive them. Customers should verify that all of their Connectors are up and running if any Resources are inaccessible at this time.
monitoring Nov 16, 2021, 08:16 PM UTC

Google Cloud Platform has marked their Cloud Networking issue as resolved and has posted a status update: https://status.cloud.google.com/incidents/6PM5mNd43NbMqjCZ5REh We are continuing to monitor our infrastructure and will mark this incident as resolved when we are confident that everything has returned to normal.
resolved Nov 16, 2021, 10:41 PM UTC

We are marking this issue as resolved as our monitoring shows that our infrastructure is operating normally and Google Cloud Platform has resolved the incident on their network. We will be following up with a post-mortem shortly.
postmortem Nov 19, 2021, 11:53 PM UTC

**Components impacted** Controller Relays **Summary** Twingate services were unavailable to service requests from approximately 17:48 to 20:09 UTC on November 16th. The result was that during this period of time, access to Twingate and protected resources was limited, existing connections were dropped, and new connections were refused. At the end of the period, normal service resumed. Remediation required that customers reconnect their Connectors in order to restore access to protected resources. **Root cause** Google Cloud Platform \(GCP\) deployed a configuration change in their infrastructure that caused all requests to return 404 errors \([GCP incident description](https://status.cloud.google.com/incidents/6PM5mNd43NbMqjCZ5REh)\). Because Twingate relies on GCP infrastructure, access to the Twingate network and protected resources was impacted. GCP confirms that the incident was resolved as of 19:28 UTC. As GCP began to restore their service, impacted Twingate services automatically came back online. Currently, Twingate Clients and Connectors view 404 errors as unrecoverable states and thus did not automatically reconnect. Consequently, customers were required to restart their Connectors and the Windows service on the Windows Client to restore access. **Corrective actions** Automated monitoring alerted Twingate to the outage and our DevOps and on-call engineering teams started tracking the issue. Manual testing confirmed the outage, and additional investigation showed that other GCP customers were impacted. While traffic was being restored, systems indicated that Connectors did not automatically recover. For customers using our Managed Connectors, these were restarted at 20:50 UTC. We began notifying customers about the need to restart Connectors at approximately 19:00 UTC, and all customers were notified by 02:02 UTC on November 17th. Looking ahead, we plan to: * Prioritize Client and Connector reconnection behavior and extend it to include all non-recoverable errors * Introduce functionality to notify customers of Connector downtime via email notifications