Customer.io incident

US infrastructure delays with a subset of deliveries sending

Customer.io experienced a minor incident on November 24, 2025 affecting Message Sending, lasting 36m. The incident has been resolved; the full update timeline is below.

Started: Nov 24, 2025, 08:15 AM UTC
Resolved: Nov 24, 2025, 08:51 AM UTC
Duration: 36m
Detected by Pingoru: Nov 24, 2025, 08:15 AM UTC

Affected components

Message Sending

Update timeline

investigating Nov 24, 2025, 08:15 AM UTC

Our team is aware of an issue with Campaign and Transactional deliveries going out. We are currently looking into this.
identified Nov 24, 2025, 08:38 AM UTC

We identified the issue, backlog is processing now
identified Nov 24, 2025, 08:45 AM UTC

Transactional queue backlog is process, priority queue is still ongoing
resolved Nov 24, 2025, 08:51 AM UTC

We confirm the incident has been resolved. The system is fully operational.
postmortem Dec 11, 2025, 04:12 PM UTC

**Duration:** 2h 14m, Nov 24 8:08 AM UTC - 10:22 AM UTC\] **Severity:** P2 **Impact:** Reduced message processing rates across US infrastructure ### What Happened On November 24th, our message rendering service experienced degraded performance during a period of high traffic. The service, which renders your messages for delivery, encountered memory constraints that caused intermittent service restarts and slower processing rates for priority queues across our US infrastructure. ### Customer Impact * **Affected Services:** All message types including campaigns, transactional messages, and journey workflows **Date:** November 24, 2025 * **Geographic Scope:** US-based workspaces * **Performance Impact:** Message processing slowed to approximately 50% of normal capacity during peak impact * **Duration:** 2 hours 14 mins of degraded performance * **Data Integrity:** No messages were lost; all queued messages were successfully processed ### Root Cause Our message rendering service runs on an auto-scaling infrastructure that automatically adjusts capacity based on workload. During this incident, sudden traffic spikes caused individual servers to consume memory faster than our auto-scaling could compensate. When servers reached memory limits, they restarted automatically \(as designed for resilience\), but these rolling restarts reduced our overall processing capacity during a time of peak demand, creating a compound effect. ### Resolution **Immediate fix:** We deployed updated code to our rendering service that better manages memory consumption during traffic bursts, preventing the cascade of restarts that degraded performance. **Why this works:** The update implements more efficient memory allocation patterns and adds throttling mechanisms that prevent any single traffic burst from overwhelming individual servers, regardless of auto-scaling speed. ### What We're Doing to Prevent This 1. **Smarter Resource Management** \(Completed\) * Deployed code optimizations that prevent memory exhaustion during traffic spikes * Implemented per-node workload throttling to maintain stability 2. **Improved Auto-scaling** \(In Progress - Q1 2026\) * Tuning our auto-scaling to be more predictive rather than reactive * Increasing baseline capacity to handle larger bursts without scaling delays 3. **Better Early Warning** \(In Progress\) * Adding memory pressure alerts that trigger before critical thresholds * Implementing graduated responses to traffic spikes \(pre-scaling based on queue depth trends\) ### Our Commitment While our platform maintained data integrity throughout this incident \(no messages were lost\), we understand that processing delays impact your customer engagement timing. We are actively working to ensure all customers’ messages are processed and delivered as efficiently and reliably as possible. ### Questions? Your Customer Success Manager has details specific to your workspace's impact during this incident. For technical questions or to discuss our infrastructure roadmap, please reach out to your account team.