InfluxData incident
Degraded read performance in Azure US-East 1
InfluxData experienced a critical incident on November 11, 2025 affecting Web UI and API Queries, lasting 1d. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Nov 11, 2025, 02:46 AM UTC
Investigating fix.
- identified Nov 11, 2025, 03:20 AM UTC
A fix has been implemented. We are monitoring whether read performance has improved.
- identified Nov 11, 2025, 05:49 AM UTC
We are still seeing stability issues. Investigating other approaches.
- identified Nov 11, 2025, 06:15 AM UTC
We are still working on the resolution. Access to the InfluxDB Cloud UI and API will result in internal errors.
- identified Nov 11, 2025, 10:35 AM UTC
We are continuing to work on a fix for this issue.
- identified Nov 11, 2025, 02:36 PM UTC
We are still working on the resolution.
- identified Nov 11, 2025, 02:45 PM UTC
We are continuing to work on a fix for this issue.
- identified Nov 11, 2025, 03:13 PM UTC
We have identified the issue and continuing to work on a fix, we are working on applying a fix.
- identified Nov 11, 2025, 04:03 PM UTC
We have identified the issue and continuing to work on a fix.
- identified Nov 11, 2025, 05:19 PM UTC
We have identified the issue and continuing to work on a fix.
- identified Nov 11, 2025, 07:02 PM UTC
We have implemented a fix that is resulting in steady system recovery.
- identified Nov 11, 2025, 07:25 PM UTC
Recovery is continuing to progress.
- identified Nov 11, 2025, 09:01 PM UTC
Recovery is continuing, with setbacks.
- identified Nov 11, 2025, 10:56 PM UTC
We apologize for the length of time that it is taking to bring this cluster back into service. The team is working diligently to restore normal operation.
- identified Nov 11, 2025, 10:57 PM UTC
We apologize for the length of time that it is taking to bring this cluster back into service. The team is working diligently to restore normal operation
- monitoring Nov 12, 2025, 01:13 AM UTC
We have increased the cluster capacity, and it’s currently recovering. However, due to a large backlog of data, the TTBR remains elevated for now. Our team continues to monitor this issue closely.
- monitoring Nov 12, 2025, 02:46 AM UTC
The cluster has been scaled up, and system performance has returned to normal. Writes and queries are now processing as expected, and TTBR has stabilized. We sincerely apologize for the disruption to service over the past 24 hours and will share a detailed root cause analysis (RCA) once it is complete.
- resolved Nov 12, 2025, 04:17 AM UTC
This incident has been resolved.
- postmortem Nov 14, 2025, 06:33 PM UTC
RCA for Cloud 2 prod01-us-east-1 outage on Nov 11, 2025 # Summary The incident began during a planned migration of storage pods to a newer, more performant node class in Azure. The root cause was insufficient capacity in the new Azure node pool. When additional capacity could not be obtained, the team had to spend significant time rebalancing workloads across existing node pools to find a configuration capable of sustaining the write and query load. Clear communication was difficult during the incident because multiple mitigation paths were being explored in parallel, and we did not have confidence in any single fix until late in the recovery. A contributing factor to the duration of the outage was the amount of time Azure requires to apply node pool adjustments, which slowed our ability to test and iterate on potential solutions. Once sufficient capacity was available and the storage pods were redistributed appropriately, the cluster recovered. Background As part of an ongoing initiative to improve performance across Cloud 2, we are upgrading clusters to newer virtual machine types that were not available when the platform originally launched. These newer node classes provide higher throughput and allow us to right-size workloads more efficiently. This upgrade process has been completed successfully and without downtime in our AWS and GCP environments, as well as in our Azure staging cluster. As with all planned maintenance, an on-call engineering team was monitoring the migration and prepared to intervene if unexpected behavior occurred. Incident Details Cloud 2 clusters consist of 128 storage pods \(64 primary and 64 secondary\). Normal operation requires all 64 pods to be available \(either primary or secondary\). Our standard migration sequence moves secondary pods first, monitors stability, and then transitions primary pods once the new node pool has demonstrated adequate capacity. During the migration, secondary pods were successfully moved to the new Azure node pool and appeared stable under their normal load. However, when we initiated the transition of primary pods, the increased traffic on the secondaries exposed that the new pool did not have enough capacity to sustain the full workload. As soon as the capacity issue became apparent, engineers began working through multiple mitigation paths in parallel, including alternative node-pool configurations and fallback options. While the primaries were active, they were absorbing a significant share of read and write traffic, masking the fact that the secondaries were running close to their limits. When we attempted to scale the new node pool to provide the required capacity, Azure reported that no additional nodes of that type were available in the region. This prevented in-place scaling and removed the most direct mitigation path. To restore service, the team had to evaluate multiple fallback configurations using other node pools still available to us. Each adjustment required Azure to perform a full node-pool update cycle, which includes provisioning a temporary pool, shifting workloads, and then applying changes to the original pool. These operations take a significant amount of time in Azure, which slowed our ability to test and validate alternative configurations quickly. The combination of these factors \(an under-capacity new node pool, limited regional availability for scaling, and long Azure node-pool update times\) extended the overall duration of the outage. Once additional capacity using the prior node class was provisioned and storage pods were redistributed across the updated pool layout, the cluster stabilized and normal performance resumed. # Communication Communication during the incident did not meet expectations. Although we issued updates on our status page \([status.influxdata.com](http://status.influxdata.com)\), they lacked clarity because the team was evaluating several mitigation paths concurrently and did not have confidence that any single approach would resolve the issue. We were also slow to communicate full recovery. Additionally, an internal miscommunication led to inaccurate information being shared in our community Slack, suggesting the issue was not being addressed due to the U.S. holiday. This was incorrect; engineers were actively working throughout the incident. We recognize the impact this had on customer trust and have identified this as a process issue. # Future Mitigations 1. Strengthened Capacity and Quota Planning We will incorporate earlier and stricter quota and capacity validation into the migration process. Future node-pool changes will require confirming that available Azure quota is at least double the projected capacity needed for a safe migration. 2\. Revised Migration Procedure We will update our migration approach to move smaller slices of primaries and secondaries together. This will surface capacity issues earlier and prevent scenarios where a node pool appears healthy until the final migration step. 3\. Improved Internal Communication During Incidents We are refining our internal communication channels to ensure all customer-facing employees receive timely, authoritative information during outages. We are also reviewing expectations for status-page updates to ensure clarity and consistency.