InfluxData incident

Degraded query performance in GCP us-central1

InfluxData experienced a major incident on November 10, 2025 affecting API Queries and Tasks, lasting 2h 48m. The incident has been resolved; the full update timeline is below.

Started: Nov 10, 2025, 09:04 PM UTC
Resolved: Nov 10, 2025, 11:52 PM UTC
Duration: 2h 48m
Detected by Pingoru: Nov 10, 2025, 09:04 PM UTC

Affected components

API QueriesTasks

Update timeline

investigating Nov 10, 2025, 09:04 PM UTC

We are investigating increased query errors within the region
investigating Nov 10, 2025, 10:17 PM UTC

We are continuing to investigate this issue.
monitoring Nov 10, 2025, 10:48 PM UTC

A fix has been implemented and we are monitoring the results.
resolved Nov 10, 2025, 11:52 PM UTC

This incident has been resolved.
postmortem Nov 25, 2025, 11:19 PM UTC

RCA for Cloud 2 prod01-us-central-1 query outage on Nov 10, 2025 # Summary The SRE team has been working on a long-term project to review and rebalance the storage workloads, to provide better separation of workloads within nodepools, which will improve performance, reliability and ease of maintenance. The Cloud 2 service is designed with a set of primary and secondary storage pods. During normal deployments, whenever a primary pod is restarted, the secondary pod takes over, so that the service is not interrupted. The change that was made in this cluster was a change to the primary and the secondary pods. When the configuration change was made in prod01-us-central-1, it caused some of the primary storage pods and their corresponding secondary pods to be restarted at the same time. While the pods were unavailable, queries failed. Once the pods restarted, the query error rate went back to normal levels and the cluster recovered. # Timeline **Time \(UTC\) What happened** **8:45pm** The engineer-on-call was paged due to a high rate of query errors in the cluster. Upon investigation, the primary and secondary pods for the same storage slice are unavailable. **8:50pm** Engineer-on-call identified that the problem was due to a misconfiguration that caused the primary and secondary pods to be restarted at the same time. **9:00pm** The system will recover by itself when the pods restart with the new configuration, so no intervention is required. **9:20pm** Queriers start recovering, and error rate is returning to pre-incident level. Some lingering errors in nginx. **10:00pm** Query backlog is fully cleared. The cluster has fully recovered. # Future Mitigations **1. Reviewing procedures relating to storage pod reconfigurations** Going forward we will ensure that configuration changes for primary pods and secondary pods are never applied at the same time.