MongoDB incident

Metrics ingestion delays and delayed cluster operations

Minor Resolved View vendor source →

MongoDB experienced a minor incident on October 7, 2025 affecting MongoDB Cloud, lasting 6h 20m. The incident has been resolved; the full update timeline is below.

Started
Oct 07, 2025, 01:42 PM UTC
Resolved
Oct 07, 2025, 08:03 PM UTC
Duration
6h 20m
Detected by Pingoru
Oct 07, 2025, 01:42 PM UTC

Affected components

MongoDB Cloud

Update timeline

  1. investigating Oct 07, 2025, 01:42 PM UTC

    We are currently investigating a delay in our metric ingestion pipeline. Customers may see delays in cluster operations.

  2. investigating Oct 07, 2025, 03:45 PM UTC

    We are continuing to investigate the issue. During this time Host Down alerts will not fire and metrics may be missing or delayed for Atlas and Cloud Manager Clusters. Cluster operations may also be delayed.

  3. identified Oct 07, 2025, 05:44 PM UTC

    We have identified the cause of the issue and have taken corrective actions. We are monitoring the impact of those mitigations. During this time Host Down alerts, metrics, and some Atlas Cluster Operations may still be delayed or missing.

  4. monitoring Oct 07, 2025, 07:32 PM UTC

    A fix has been implemented and we are monitoring the results.

  5. resolved Oct 07, 2025, 08:03 PM UTC

    This incident has been resolved.

  6. postmortem Oct 13, 2025, 03:04 PM UTC

    ## Executive Summary **Incident Date/Time**: October 6–7, 2023 **Duration**: Approximately 2 days \(partial impact observed intermittently over this period\) **Impact**: * Delays in metrics ingestion impacting the timeliness of operational data displayed in MongoDB Atlas dashboards. * Temporary degradation of Atlas Cluster management operations due to backend system strain. * No disruptions to the health, availability, or data integrity of customer clusters. **Root Cause**: Increased load on an internal backing database servicing Atlas and Cloud Manager monitoring systems due to a combination of unsharded high-traffic collections concentrated on a single shard, inefficient query patterns, and spikes in resource consumption coinciding with a software rollout. **Status**: Resolved ## What Happened On October 6 and 7, MongoDB Atlas and Cloud Manager encountered temporary delays in metrics ingestion and backend disruptions affecting certain operational workflows. The primary contributors were elevated resource consumption and localized data distribution challenges in an internal database cluster supporting critical monitoring and operational systems. Initial investigations pointed to high resource usage in one shard of the backing database cluster. However, further review revealed systemic inefficiencies, including: * Unsharded high-traffic collections leading to uneven data distribution across shards. * Inefficient query patterns, such as full collection scans, amplifying strain on high-load shards. * Elevated resource consumption during the rollout of an updated internal software version, increasing system load beyond anticipated thresholds. Though the functionality and availability of customer clusters remained unaffected, customers experienced degraded monitoring performance and fewer timely Atlas dashboard updates. MongoDB implemented mitigation measures to stabilize the system and then resolved long-term root causes to restore operational workflows. ## Impact Assessment Affected Services: * Metrics ingestion \(monitoring performance\). * Atlas Cluster Management Operations \(temporary delays\). Customer Impact: * Delayed metrics ingestion within Atlas dashboards limited real-time visibility into operational data. * Temporary delays in Atlas cluster management operations such as provisioning, resizing, and other backend workflows. * No loss of data or disruption to customer application availability or performance. ## Root Cause Analysis The incident resulted from several contributing factors: * High-traffic collections concentrated on a single shard led to hotspots within the internal database cluster during periods of elevated load. * Inefficient query designs placed additional strain on the impacted shard during standard operations. * The rollout of an updated software generated a temporary spike in resource demands, exacerbating load-related challenges in the cluster. Combined, these factors overwhelmed the targeted shard and contributed to backend delays affecting metrics ingestion and operational requests. ## Prevention MongoDB has identified several lasting improvements and implemented strategic fixes to prevent recurrence: 1. Collections experiencing concentrated load will be sharded to distribute traffic more evenly across multiple nodes, alleviating pressure on single shards. 2. Inefficient queries are being optimized to improve resource utilization and reduce latency during routine operations. 3. Additional infrastructure capacity has been provisioned to better handle elevated traffic volumes. Capacity planning processes are also being refined to anticipate future spikes in load. 4. Processes for deploying updated versions of software are being redesigned to account for predictable increases in system resource demands during rollouts, ensuring smoother deployment. ## Next Steps * MongoDB Engineering teams will continue monitoring the performance of sharding strategies and query optimizations to ensure effective resolution of hotspots. * Updated capacity models will incorporate metrics from this event to strengthen proactive planning across the Atlas platform. * Feedback mechanisms for detecting elevated load conditions will be further expanded to provide faster anomaly detection and response. ## Conclusion We apologize for the impact of this event on our customers. We are aware that this outage had an impact on our customer’s operations. MongoDB’s highest priorities are security, durability, availability, and performance. We are committed to learning from this event and to update our internal processes to prevent similar scenarios in the future.