The Things Industries incident
Intermittent Console and API issues on The Things Stack Cloud
The Things Industries experienced a notice incident on February 17, 2024 affecting Europe 1 (eu1) and North America 1 (nam1), lasting 2h 25m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Feb 17, 2024, 11:02 AM UTC
We are investigating console access and API connection issues on The Things Stack Cloud. We will provide more details as we progress.
- identified Feb 17, 2024, 12:26 PM UTC
We have identified the issue and deploying a fix to resolve it.
- monitoring Feb 17, 2024, 01:08 PM UTC
A fix has been deployed and we are monitoring the results.
- resolved Feb 17, 2024, 01:27 PM UTC
This issue is now resolved.
- postmortem Feb 19, 2024, 09:43 AM UTC
We have faced a minor operational issue with The Things Stack Cloud with regards to external API availability in all of our clusters, affecting mainly Console usage. Traffic processing and delivery was not affected. ### Cause The root cause of this issue is that the service which we use for [load balancing](https://en.wikipedia.org/wiki/Load_balancing_(computing)) and request routing, [Envoy](https://www.envoyproxy.io/), had a bug \([1](https://github.com/envoyproxy/envoy/issues/32401), [2](https://github.com/envoyproxy/envoy/issues/32371)\) in their HTTP/2 request processing library. We have upgraded to the latest release of Envoy at the time, `v1.29.0`, as part of our `v3.29.0` release, and initially did not experience any elevated timeout or error rates in our load balancer. However, over the past two days more reports of these timeouts occurred and we have decided to rollback our Envoy version upgrade. We have not observed any elevated timeout rates since. We monitor failed request rates _inside_ a cluster, but not at the _edge_ of the cluster, where the load balancer operates, as we deem these rates to be more accurate near the components which experience failures. We will be looking into possible improvements in our monitoring in order to account for such issues in the future. ### Resolution We have rolled back to the last known working Envoy version, `v1.28.1`, and have no longer been able to reproduce the sporadic timeouts. --- Adrian-Ștefan Mareș Head of Engineering, The Things Industries