Ably incident
Transient netsplit in us-east-2 leading to protracted netsplit for a few channels
Ably experienced a notice incident on August 12, 2024 affecting US East (Ohio), lasting —. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- resolved Aug 12, 2024, 02:08 PM UTC
At 03:32 UTC on the 12th, the us-east-2 (Ohio) data center was partially partitioned (netsplit) from the rest of the cluster. The netsplit was healed by 03:40. However, a very small proportion of channels did not self-heal, and remained partitioned between us-east-2 and at least one other site in the cluster (mostly either eu-central-1 or eu-west-1) until we took manual corrective action to fix them, which was completed in the production cluster at 12:22 UTC. Our existing quality-of-service alerting infrastructure should have lead to us being paged at the time, however, due to a bug in the definition for this particular alert, this did not happen. Reliable delivery of messages is the thing we prioritise above all else at Ably, and so even though the number of affected channels was small, we are taking this extremely seriously, and (other than fixing the bug in the self-healing function) we will be prioritising improvements to our alerting framework to ensure that no future similar issue can be allowed to last this long. We are deeply sorry to those customers who had affected channels.