Batch incident

CEP Campaign Delivery Degradation (Email, Push & SMS)

Batch experienced a notice incident on December 22, 2025 affecting Dashboard and Email delivery and 1 more component, lasting 3h 17m. The incident has been resolved; the full update timeline is below.

Started: Dec 22, 2025, 01:43 PM UTC
Resolved: Dec 22, 2025, 05:01 PM UTC
Duration: 3h 17m
Detected by Pingoru: Dec 22, 2025, 01:43 PM UTC

Affected components

DashboardEmail deliveryPush DeliverySMS Delivery

Update timeline

monitoring Dec 22, 2025, 01:43 PM UTC

An incident affected CEP campaign delivery across Email, Push, and SMS between 11:00 and 14:30 CET. * Campaigns: Messages scheduled during this window were not sent and not retried. Analytics will show zero delivery and engagement. If impacted, consider resending your campaign. * Automations: Delays occurred between 11:40 and 12:15 CET. Affected automations were eventually sent, and the displayed analytics are accurate. The issue has been fixed, and we are actively monitoring the platform.
resolved Dec 22, 2025, 05:01 PM UTC

This incident has been resolved.
postmortem Dec 24, 2025, 01:51 PM UTC

## Post-mortem: Partial Platform Outage ### Summary On December 22, 2025, between **10:17 and 13:40 UTC**, part of our CEP experienced service disruptions affecting message delivery, data ingestion, and dashboards for some customers. The incident was caused by a **hardware-related issue on a high-grade server** hosting part of our private virtualization infrastructure. The issue has since been fully resolved, and corrective actions have been implemented. ### Impact During the incident window, the following impacts were observed: * **Push and Email Campaigns \(CEP\):** Delivery was interrupted for a subset of campaigns. * **Data Ingestion & Analytics:** Temporary ingestion delays occurred. * **Dashboards:** The Batch dashboard was intermittently unavailable. No impact was observed on SMS delivery. The incident affected a limited subset of customers, depending on their usage at the time. ### Timeline \(UTC\) * **10:17** – First service degradation detected. * **10:20** – Investigation initiated. * **11:00** – Infrastructure issue identified on one hypervisor. * **11:30** – Mitigation and recovery actions started. * **13:40** – All impacted services fully operational. ### Root Cause The incident was caused by a **fault in the cooling system** of a high-grade server used in our private virtualization infrastructure. This cooling issue led to **overheating**, triggering a **protective shutdown of a disk group** on the affected server. As a result, the hypervisor abruptly lost access to multiple disks, causing all virtualized services hosted on that node to stop simultaneously. Although this class of hardware is designed to provide strong reliability guarantees, this cooling failure resulted in the loss of a single hypervisor and exposed the impact of service colocation on shared infrastructure. ### Resolution * The cooling issue and impacted hardware were fully repaired and validated by our infrastructure provider. * Affected services were restarted and resynchronized. * Data consistency was verified after recovery. ### Corrective and Preventive Actions Following this incident, we took the following actions: * The faulty hardware was repaired and removed from service until fully validated. * We reviewed service placement on our private virtualization infrastructure and **reduced the colocation of critical components** on single hypervisors. * Stateful services \(including Redis clusters\) were redistributed to **limit the blast radius** of a single host failure. * We **strengthened monitoring and alerting around service colocation**, allowing us to detect and act earlier when multiple critical components are unintentionally placed on the same underlying host. These actions aim to reduce the impact of similar infrastructure-level incidents in the future. ### Conclusion We apologize for the disruption this incident caused. While hardware failures of this nature are rare, this event highlighted areas where we could further improve infrastructure resilience and observability. We remain committed to transparency and continuous improvement. — **The Batch Engineering Team**