The Things Industries Outage History

Major March 23, 2026

Gateways disconnected from The Things Stack Cloud in the nam1 cluster

Detected by Pingoru: Mar 23, 2026, 10:44 AM UTC
Resolved: Mar 23, 2026, 11:40 AM UTC
Duration: 55m

Affected: North America 1 (nam1)

Timeline · 4 updates

identified Mar 23, 2026, 10:44 AM UTC

An update intended for other regions was inadvertently applied to the nam1 region outside of its scheduled maintenance window. As a result, some gateways in nam1 are experiencing connectivity issues. We sincerely apologize for the disruption. Our team is actively investigating and has the situation under close monitoring. We will provide further updates as the investigation progresses.
monitoring Mar 23, 2026, 11:10 AM UTC

A fix has been implemented and we are monitoring the results.
resolved Mar 23, 2026, 11:40 AM UTC

This incident has been resolved.
postmortem Mar 23, 2026, 12:53 PM UTC

## Summary On March 23, 2026, during a scheduled maintenance window, the Gateway Server \(GS\) component in the NAM1 region was accidentally restarted, causing gateways to disconnect in a similar pattern to the March 3 incident. The development team took the opportunity to roll out a fix that had been planned for the next maintenance window. The fix resolved the reconnection issue and gateways are now reconnecting within several minutes. ## Impact Some gateways got disconnected for some of the tenants in the NAM1 region following an accidental Gateway Server restart during a maintenance window. ## Root Cause The incident was triggered by an accidental restart of the Gateway Server component in NAM1 during a scheduled maintenance window, causing gateways to disconnect in the same pattern observed during the March 3 incident — where simultaneous reconnects under high server load led to premature connection drops due to insufficient timeout and cache configurations. ## Resolution The development team used the opportunity to roll out a fix ahead of its planned release date. The deployed fix resolved the reconnection bottleneck, and affected gateways are now reconnecting within several minutes. No customer action is required. ## Prevention / Action items ### Process improvements Procedures around component restarts during maintenance windows will be reviewed to prevent accidental restarts of production-critical components such as the Gateway Server. ### Infrastructure improvements — already applied The fix rolled out during this incident addresses the reconnection performance issues identified in the March 3 post-mortem. Gateway Server instances in NAM1 are now able to handle mass reconnect scenarios significantly more efficiently, with reconnection times reduced to within several minutes.

Read the full incident report →

Major March 3, 2026

Gateway Connectivity Issues

Detected by Pingoru: Mar 03, 2026, 08:08 PM UTC
Resolved: Mar 05, 2026, 10:16 AM UTC
Duration: 1d 14h

Affected: North America 1 (nam1)

Timeline · 4 updates

investigating Mar 03, 2026, 08:08 PM UTC

We are observing gateways disconnects for some of tenants in NAM1, this seems to be triggered by an AWS update procedure which has affected the GatewayServer component. We are currently investigating the issue
monitoring Mar 04, 2026, 02:28 AM UTC

Affected gateways have reconnected and service has recovered without intervention. We are monitoring for any recurrence and continuing to investigate the root cause.
resolved Mar 05, 2026, 10:16 AM UTC

This incident has been resolved.
postmortem Mar 05, 2026, 10:17 AM UTC

## Summary On Mar 3, 2026 , it was reported that some gateways lost connection to the LNS for some of the tenants in NAM1 region. This was triggered by an AWS update procedure which has affected the Gateway Server component. Although the affected gateways reconnected eventually, it took longer than expected \(8 hours for some tenants\). ## Impact * Some gateways got disconnected for some of the tenants in NAM1 region. * Three Gateway Server instances restarted during the incident, disconnecting a large number of gateways which failed to immediately reconnect to remaining active instances. * The number of connected gateways kept declining gradually. * Eventually, the affected gateways have reconnected and service has recovered without intervention. ## Root Cause The incident was triggered by an AWS infrastructure event \(task retirement\), which caused several Gateway Server instances in NAM1 to undergo a rolling restart. As instances restarted one by one, gateways began disconnecting gradually. Since the restart was rolling rather than simultaneous, some gateways maintained their connection to instances that remained active throughout the event. The root cause of the prolonged recovery, however, was a short connection timeout configured on some gateways. With a large number of gateways attempting to reconnect simultaneously, the Gateway Server was operating under unusually high load — and the short timeout was insufficient under these conditions, causing connections to close prematurely before they could be fully established. This cycle repeated until the restarted instances completed their post-restart operations — at which point server load normalised and Gateway Server caches became available, significantly speeding up the connection process for the remaining disconnected gateways until service was fully restored. In short: the AWS infrastructure event triggered the affected gateways disconnects, but the timeout misconfiguration is what made the recovery take up to 8 hours. ## Resolution There was no manual intervention to resolve this incident. The affected gateways reconnected automatically after the downtime. ## Prevention / Long-term improvements ### Proactive outreach to affected tenant owners regarding gateway connection timeout A minimum 60-second timeout is necessary for reliable connection establishment under high server load conditions. We will be reaching out to affected tenant owners, recommending that the `TC_TIMEOUT` setting in their Basic Station configuration is set to at least the default value of `60s`. This change will help prevent premature connection drops during periods of elevated reconnect activity. ### Documentation improvements Existing documentation will be improved to specifically address recommendations for longer `TC_TIMEOUT` setting of the Basic Station configuration. ### Infrastructure improvements Our Cloud infrastructure configuration will be improved to reduce and accommodate higher instance load post-restart.

Read the full incident report →

Gateways disconnected from The Things Stack Cloud in the nam1 cluster

Gateway Connectivity Issues

Looking to track The Things Industries downtime and outages?