Ably experienced a major incident on October 29, 2025 affecting Integrations, lasting 44m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Oct 29, 2025, 08:57 AM UTC
We are experiencing an elevated level of errors impacting Ably Reactor Webhook delivery. We have identified the issue and are currently deploying a fix.
- identified Oct 29, 2025, 08:58 AM UTC
The issue has been identified and a fix is being implemented.
- identified Oct 29, 2025, 09:21 AM UTC
We are continuing to work on a fix for this issue.
- identified Oct 29, 2025, 09:26 AM UTC
A fix has been deployed and Ably Reactor Webhook delivery service is recovering.
- resolved Oct 29, 2025, 09:42 AM UTC
This incident has been resolved.
- postmortem Nov 07, 2025, 09:16 AM UTC
# Summary of incident affecting the Ably production service on 28-29 October - investigation and conclusions ## Overview There was an incident affecting Ably's service between 28 and 29 October 2025, which affected the delivery of webhooks. The impact of the incident lasted from 1134 on 28 October to 0936 on 29 October \(all times UTC\). The incident occurred as a result of the deployment of new webhook functionality that was incompatible with many already-configured webhook rules. Now we have a good understanding of the underlying problem, the various factors that led to that having customer impact, and the effectiveness of our response in this case. ## Background Ably operates a number of production clusters for its service around the globe. There is a main production cluster that services the majority of customer accounts, plus a small number of dedicated clusters for specific accounts. Each of the clusters has a presence in multiple regions in a globally federated system. ### Ably integration rules Ably processes messages that are published on channels, and directs those to realtime subscribers on those channels, and also directs them to other targets as configured by integration rules. Multiple targets are supported for egress integration rules, including message queues \(Kafka, AMQP\) and generic HTTP endpoints \(webhooks\). Rules are configured via a control plane which supports: * direct interactive \("click-ops"\) configuration in the Ably dashboard; * a [Control API](https://ably.com/docs/api/control-api); * a [Terraform Provider](https://ably.com/integrations/ably-terraform-provider) which acts through the Control API. The source of truth for integration rule definitions is Ably's core "realtime" platform, and the other services \(the website and control API service\) obtain access to that via an internal provisioning API. Integration rule definitions comprise some attributes that are generic \(such as `channelFilter` that specifies which channels the rule applies to\), and some that are specific to each type of integration rule target \(such as the target endpoint, credentials, and any protocol-specific attributes\). Some generic attributes - ie attributes that are capable of having a meaning for multiple rule types - are not universally supported, since the evolution of rule features is based on adoption and priority. As a result, the schema for integration rule definitions contains multiple optional generic attributes as well as a wide range of type-specific attributes. One generic feature is an option that enables a custom set of trust anchors \(TLS certificates\) to be configured for a rule; this is to support integrations that target TLS endpoints that have a self-signed certificate rather than certificates issued by a public Certificate Authority \(CA\). This feature is used by customers having self-hosted endpoints, either as part of development/CI infrastructure, or as internally-facing production services where the operational effort of maintaining a certificate from a public CA is not justified. At the time of the incident custom trust anchors were supported for Apache Pulsar and Kafka rules, and support had just been added for webhooks. ### Ably's service deployment process Ably adopts a staggered deployment approach whereby each change is deployed gradually, minimising the adverse impact of any regression that might exist in new software versions having escaped all prior test phases. Changes are typically deployed on a weekly basis; usually in the main \(multi-tenanted\) cluster this means that changes are deployed to the ap-southeast-2 \(Sydney\) region first, and error rates monitored, before deploying to the remaining regions starting the following day. However, if a deploy needs to be expedited for any reason, subsequent regions can be deployed after a change has been in ap-southeast-2 for only a few hours. Most monitoring is automated; metrics are gathered for a wide range of service health metrics, including the rates of logged error events, and the rate of error API responses. When error metrics exceed a threshold then on-call engineers are paged to investigate. Since the majority of Ably's infrastructure comprises immutable EC2 instances, any deploy consists of deploying new instances with the new software version in parallel with retirement of superseded instances. For a normal deploy this process happens over the course of several hours so that the number of instances being retired at any given time is small, minimising any impact of the update on existing traffic. If a deploy needs to be accelerated, such as if making an emergency rollback, then deployment would more typically involve deploying a full complement of replacement instances immediately, followed by termination of old instances at a rate that reflects the urgency of the operation. Immediate and abrupt termination of multiple instances simultaneously leads to a greater risk of loss of ephemeral channel state, so even in accelerated rollbacks the preference is for old instances to be terminated gradually so that there is an orderly shedding of load so as to avoid any data loss. ## As it happened A change to add support for custom trust anchors was implemented, and this change was part of a set of changes deployed in a staggered deployment starting on 28 October. This happened in ap-southeast-2 from 1134 and, since this change set also included a fix for an unrelated problem, it was expedited and started in other regions from around 1500. Starting at 1940 on 28 October there was a handful of customer reports of a problem with webhook delivery. There were two independent customer reports of issues on 28 October and a handful of reports overnight. At 0731 on 29 October Ably's customer support team escalated the problem reports to Ably's engineering team for investigation. At 0745 an incident was declared, on-call engineers were paged and investigation started. The problem was determined to be a regression introduced in the previous day's deploy. At 0746 a rollback process was initiated to revert the changes. Rollback was initiated first in ap-southeast-2. At 0811 rollback was initiated for all regions and dedicated clusters that had received the changes. The rollback completed at 0936 and errors dropped to zero. ## Investigation The investigation aimed to understand: * the cause of failed webhook invocations; * how Ably's testing had failed to detect the regression; * how Ably's monitoring had failed to detect the regression; * the effectiveness of the response, including the time taken to respond following the first customer problem reports. The conclusions of the investigation are summarised below. ### The cause of failed webhook invocations The new code that caused the problem was adding support for custom trust anchors for webhooks. This code would look in the rule configuration for a configured set of trust anchors and, if present, use those instead of the system set of trust anchors for public CAs. The problem was caused by the fact that a significant number of preexisting rule definitions already had an empty set of trust anchors configured. The new code then used that empty set instead of the default set and failed to verify the certificate chain for any TLS endpoint. These failures were logged internally with a severity that classifies the event as a client configuration error rather than a system error. The presence of the empty trust anchor set in existing rule definitions was due to a long-standing behaviour in the web \(dashboard and control API\) services that create and update rules via the realtime platform's provisioning API. When custom trust anchor support was first added for Pulsar rules \(in November 2021\), the set of custom trust anchors became a generic rule attribute; but the web service would configure an empty set instead of a true zero value when custom trust anchors were unset. This meant that any rules that had been created since November 2021, even for types that did not initially support custom trust anchors, would have an empty set configured. Remedial actions identified in this area are as follows. **Fix the immediate regression.** As an interim behaviour, interpret an empty custom trust anchor array as if custom trust anchors are unset \(and so use the default trust anchors\). **Correct web service behaviour.** Update the web service so that a true zero value is set instead of an empty array when custom anchors are not set. **Correct the configuration of existing rules.** Perform a database update for all existing rule definitions that have an empty array configured. ### The failure of tests to detect the regression The relevant tests are a combination of unit tests and integration tests within the test suite of the realtime platform. Unit test coverage includes test coverage of all webhook configuration options \(and new tests were included with the deployed change\), plus tests for different endpoint behaviours \(HTTP 1 vs HTTP/2 for example\). Unit tests mock different parts of the surrounding code and system in order to have tests that are as targeted as possible. Integration tests include full end-to-end tests that create integration rules via the provisioning API that point to a range of endpoints existing as part of the CI system. The reason that none of these tests identified the problem is that none used the configuration in the same way as the web service; in particular, any rule that did not have custom trust anchors would be configured with a null entry. There was no expectation in the tests that an empty set should be interpreted as having no custom configuration. There was no integration test coverage of the end-to-end behaviour when rules are created via the web service. Remedial actions identified in this area are as follows. **Improve web service testing of integrations.** This should at least include: * create web service unit tests that verify that an unset custom trust anchor configuration results in a zero value being configured. * create web service integration tests that verify end-to-end behaviour involving creation of rules via the web service, with and without custom trust anchors set. ### The failure of monitoring to detect the regression Ably monitors a wide range of service metrics and alerts relevant engineering teams when those metrics indicate a service problem. Monitored metrics include errors logged internally whilst servicing customer traffic, the rate of error responses to customer requests, and the rate of errors or failures in traffic that is generated by Ably specifically for service monitoring. For this last category, Ably operates a series of "quality of service", or QoS, traffic generators that exercise a range of service functionality, so certain classes of failure relating to deployment can be identified that would not be identifiable via CI. Errors - both those logged internally and customer-visible error responses - can be classified broadly into: * bad requests - that is, operations that fail because the arguments or parameters of the request are invalid. When these errors are associated with an error code, it is typically in the 4xx range - these include general bad requests, plus authentication and authorisation failures; * internal errors - that is, errors that arise performing some operation that ought to succeed, but fails because of an issue \(eg an unexpected exception, a failure to communicate over the network, or an assertion failure\). When these errors are associated with an error code, it is generally in the 5xx range. A high rate of internal errors - either in our internal logs, or in customer-visible responses - is an indication of a problem with the service. A high rate of bad requests is more typically a problem on the customer side; an application is attempting operations that cannot succeed. All 5xx \(internal error\) metrics are monitored and alerts are generated when they exceed a threshold indicating a problem. Many 4xx \(bad request\) metrics are monitored but in most situations these do not generate alerts. This is for several reasons: * most are not an indication of a system problem; there is simply some invalid operation being attempted by a client; * all public endpoints are subject to a very high rate of spurious requests that fail authentication and authorisation checks, so the general rate of 4xx errors is high. Similarly, problems with invocation of a webhook \(or other rule target\) are usually a result of misconfiguration of the rule, or unavailability of the target endpoint, and these are classified with a 4xx status code. There is a material background level of such errors. Given the very high rate of background 4xx errors on external endpoints, it is generally difficult to identify any pattern on those endpoints that is indicative of a system problem. Notwithstanding this, all Ably Enterprise accounts have alerts set up both for 4xx and 5xx error metrics. 4xx alerts do not generally page an engineer out-of-hours, but both engineering and customer success teams get notified when there is an unexpected elevated rate of 4xx errors for an Enterprise customer. The configuration of those accounts within Ably's monitoring and alerting systems is added when that customer is onboarded as an enterprise customer. In this incident the overall increase in rule invocation failures was not significant in comparison with the normal ambient background error rate so, based on those metrics, there was no clear signal of there being a problem, and it was not sufficient to cross the alerting threshold on any existing enterprise accounts. Many customers with the highest usage volumes had rules created before November 2021, which were unaffected by the problem. At the time of the incident, QoS monitoring simulated traffic for some integrations but not webhooks. Remedial actions identified in this area are as follows. **Improve the ability in metrics and alerts for integration rules to discriminate between Ably problems and end-user problems.** Although most errors are a result of configuration problems, or problems in end-user infrastructure, it is still clearly possible for Ably regressions to also generate errors. We need to have more intelligent metrics and alerts that identify such occasions without creating an unsustainable number of false-positives. One possibility is to create alerts based on individual metrics labelled by rule type, to ensure that an increase in error rate for an individual rule type is alerted, even if that increase is not a material fraction of errors across all rule types. Another possibility is to correlate error levels with deploy events, or with software versions, so if a variation in error rate is correlated with a code change it is more likely to trigger an alert. **Add webhook tests to QoS monitoring.** Continuous testing in a production setting of simulated traffic should be extended to include webhooks. ### The effectiveness of the response From the first customer impact of this issue to the final resolution was 22 hours. This is far too long. Ably's response in this incident led to delays at multiple stages: * from the time of the problematic code being deployed to the first reported issue was around 8 hours; * from the first report to raising an incident was around 12 hours; * the full rollback, from the time that an incident was declared, was nearly 2 hours. Of course, had there been suitable alerting for the production problem we would have reacted sooner; but this timeline reflects the fact that even with the information we did have, the response was slow. This was due to a combination of factors: * the reports received were all received outside of business hours and, given that the customers raising those reports were not enterprise customers, there was no out-of-hours process to review them; * there was some delay before the support team correlated multiple reports to conclude that there was a problem with the delivery of webhooks; * the rollback process was comparatively slow, deliberately avoiding a more disruptive abrupt rollback since the issue had impact for a relatively small number of customers. The remedial actions in this area are as follows. **Review processes for handling and escalating issue reports out-of-hours.** Customers on lower support tiers can legitimately expect slower response times to deal with any individual issues or questions they have, but when an Ably problem arises that affects them there is no means for the relevant response team to be made aware. We will review our processes with the aim of improving support in these situations. ## Conclusion The service issues were the direct result of a regression in the Ably service - specifically, an incompatibility between a new feature \(custom trust anchors for webhooks\) and the existing state of rule configurations. The cause of the extended duration of the incident was that the classification and rate of failed delivery events resulted in no alert being generated, which meant that Ably was not aware of the ongoing problem until it was reported by customers. Although it is difficult to monitor 4xx error rates effectively, because of the high background level of such errors, there is clearly a need to improve on the current situation where Ably regressions can go undetected. There was significant additional delay in triggering a response because the reports were received on a support channel that was not monitored out of business hours. Once alerted to the problem, Ably initiated the response quickly, but the operational nature of the rollback meant that it took nearly two hours until it was completed. We take service continuity very seriously, and this incident represents another opportunity to learn and improve the level of service we are committed to providing for our customers. We are sorry for the disruption that was experienced in this instance, and are committed to identifying and implementing remedial actions so that we eliminate the risk of similar failures in the future.