Twingate incident

Twingate Service Incident - Aug 19, 2023

Minor Resolved View vendor source →

Twingate experienced a minor incident on August 19, 2023 affecting Authentication - Enterprise and Public API and 1 more component, lasting 20m. The incident has been resolved; the full update timeline is below.

Started
Aug 19, 2023, 08:01 AM UTC
Resolved
Aug 19, 2023, 08:21 AM UTC
Duration
20m
Detected by Pingoru
Aug 19, 2023, 08:01 AM UTC

Affected components

Authentication - EnterprisePublic APIAdmin ConsoleAuthorizationConnector Heartbeat

Update timeline

  1. investigating Aug 19, 2023, 08:01 AM UTC

    We are seeing issues with Twingate service and investigating.

  2. investigating Aug 19, 2023, 08:01 AM UTC

    We are continuing to investigate this issue.

  3. resolved Aug 19, 2023, 08:21 AM UTC

    This incident has been resolved. We'll publish RCA as soon as we can.

  4. postmortem Aug 23, 2023, 06:11 AM UTC

    **Summary** On August 19 at 7:51 AM UTC, Twingate received alerts of issues with the login services. Within a few minutes, the Twingate engineering team began investigating. The team quickly identified that our backend was seeing excessive timeouts from a 3rd-party API, preventing it from being able to process other requests such as authentication. After some initial fixes were unsuccessful, Twingate contacted the 3rd party and also disabled support for real-time updates that make use of these specific 3rd-party API calls. As a result, the issues started resolving at 8:10 AM UTC. Most of the services recovered quickly and full resolution occurred at 8:15 AM UTC. The vendor later confirmed and fixed the issue, and Twingate re-enabled the real-time update feature shortly after on the same day, August 19. **Root cause** The Twingate backend was exhausted due to timeouts from a 3rd-party API. **Post-incident Analysis** Twingate had already separated out most services to their own deployments, allowing those services to function throughout the incident. Therefore, only some users that needed to authenticate or re-authenticate were affected; any user that had authenticated prior to the incident was not impacted. Analysis of logs post-incident showed that the incident started at 7:49 AM UTC and fully recovered at 8:15 AM UTC. **Corrective actions** Short Term: * Separate Authentication and real-time services to their own deployments - COMPLETED Medium / Long Term: * Reevaluate and optimize timeout values for various backend and 3rd party services * Simplify the internal Twingate process for enabling and disabling features