Twingate incident

Inbound requests have heavily downgraded availability

Twingate experienced a critical incident on December 13, 2021 affecting Connector Heartbeat, lasting 1d. The incident has been resolved; the full update timeline is below.

Started: Dec 13, 2021, 05:08 PM UTC
Resolved: Dec 14, 2021, 05:42 PM UTC
Duration: 1d
Detected by Pingoru: Dec 13, 2021, 05:08 PM UTC

Affected components

Connector Heartbeat

Update timeline

investigating Dec 13, 2021, 05:16 PM UTC

We are currently investigating this issue.
investigating Dec 13, 2021, 05:41 PM UTC

We are continuing to investigate this issue.
investigating Dec 13, 2021, 06:07 PM UTC

We are continuing to investigate this issue. We will be posting regular updates pertaining to this incident.
investigating Dec 13, 2021, 06:38 PM UTC

We are continuing to investigate this issue. We will be posting regular updates pertaining to this incident.
investigating Dec 13, 2021, 07:08 PM UTC

We are continuing to investigate this issue. We have narrowed the source of the problem to the public-facing frontend servers that handle requests inbound to our service. As a result, this is broadly affecting our public API, the private API calls used by Clients and Connectors, and our web interface, resulting in heavily downgraded response availability across our service. We are still trying to identify the root cause at this time and will continue to post regular updates.
investigating Dec 13, 2021, 07:12 PM UTC

Inbound requests are now being accepted and the service is operational again. We have verified that all operational tests are succeeding. We are continuing to investigate to determine the root cause of this incident.
monitoring Dec 13, 2021, 07:21 PM UTC

We are continuing to monitor the system, and we are still investigating the root cause of the outage. We will continue to post additional updates regularly.
monitoring Dec 13, 2021, 08:39 PM UTC

We are continuing to monitor the system, and it remains stable and available. We are investigating the root cause of the outage. We will continue to post additional updates regularly.
monitoring Dec 13, 2021, 10:12 PM UTC

We are continuing to monitor the system, and it remains stable and available. We are investigating the root cause of the outage. We will post our next update at 17:00 PST / 01:00 UTC.
monitoring Dec 14, 2021, 01:07 AM UTC

We are continuing to monitor the system, and it remains stable and available. We are investigating the root cause of the outage. We will post our next update at 21:00 PST / 05:00 UTC.
monitoring Dec 14, 2021, 05:21 AM UTC

We are continuing to monitor the system, and it remains stable and available. We are investigating the root cause of the outage. We will post our next update at 09:00 PST / 17:00 UTC.
resolved Dec 14, 2021, 05:42 PM UTC

We are continuing to monitor the system, and it remains stable and available. We are closing out this incident, and we will continue to post updates and follow up with a post mortem here.
postmortem Dec 21, 2021, 04:35 AM UTC

**Components impacted** * Controller **Summary** At approximately 17:06 UTC on December 13th, we observed an increase in latency between our API layer and our backend database system. Within a few minutes, this spike in latency developed into an outage that resulted in 90-95% of requests returning one of two responses to the requestor: either a 500 \(Internal Server Error\) or a 502 \(Bad Gateway Error\) error depending on where the error in our system occurred. These error conditions were caused by timeouts occurring between our API layer and the database and persisted until approximately 19:08 UTC. During the outage, both our private and public APIs were affected. These APIs are used to drive most of the functionality that end users and administrators experience in Twingate. Specifically, this means that customers’ admin consoles were not accessible, the public API was not responsive to requests, Clients and Connectors were unable to initiate authentication, and existing connections were eventually dropped without the ability to re-authenticate. **Root cause** The root cause of the issue was due to temporary loss of connectivity and increased network latency in our cloud service provider between our API layer and backend database. The consequence of this increased latency was that even though our API layer was available to respond to requests, requests were taking significantly more time, leading to connection saturation of our API layer and the resulting rejection of additional requests, manifested as 500 or 502 errors to the requestor. **Corrective actions** In order to mitigate the risk of this root cause impacting our service in the future, we have initiated a number of improvements to isolate the impact of backend service disruptions from end user connectivity. These projects include decoupling our backend database from the Controller, scaling cross-regional database replicas for additional resiliency, and implementing changes to user connection behaviors to maintain connectivity in cases when backend services are unreachable.