Customer.io incident

Customers may be impacted by data delays

Customer.io experienced a minor incident on December 8, 2025 affecting Data Processing and Message Sending, lasting 2h 6m. The incident has been resolved; the full update timeline is below.

Started: Dec 08, 2025, 06:12 PM UTC
Resolved: Dec 08, 2025, 08:19 PM UTC
Duration: 2h 6m
Detected by Pingoru: Dec 08, 2025, 06:12 PM UTC

Affected components

Data ProcessingMessage Sending

Update timeline

investigating Dec 08, 2025, 06:12 PM UTC

We identified a recent change that might cause data delays. We are investigating.
identified Dec 08, 2025, 06:24 PM UTC

The issue has been identified and a fix is being implemented.
monitoring Dec 08, 2025, 06:28 PM UTC

A fix has been implemented and we are monitoring the results.
resolved Dec 08, 2025, 08:19 PM UTC

This incident has been resolved.
postmortem Dec 11, 2025, 03:45 PM UTC

`Incident Summary ` `On December 8th, 2025, beginning at 17:38 UTC, some customers experienced delays in data processing and message delivery. Normal functionality was fully restored at 18:19 UTC, for a total duration of 41 minutes. No data was lost. ` `A failure during the startup of an internal processing service prevented it from becoming fully operational, leading to reduced throughput and increased retry activity in upstream components. ` `Root Cause ` `During startup, one of our processing services loads information about message queues before beginning normal operation. An unexpected queue state left over from a previous configuration caused the service to encounter an error during this process, leading it to restart repeatedly without successfully completing initialization. ` `Because this service is responsible for handing off work to downstream processors, its unavailability resulted in a drop in throughput and a rise in retry traffic. The elevated retries added load to our underlying data layer and contributed to further delays. ` `Resolution and Recovery ` `Engineers identified the failing service, corrected the underlying queue state, and restored the service to full operation. Once stabilized, normal processing resumed and retry volumes returned to expected levels. The system was monitored to confirm full recovery. ` `Corrective and Preventative Measures ` `To prevent recurrence, the team is improving validation during service startup to better handle unexpected queue conditions, refining deployment procedures to detect stalled services sooner, and enhancing monitoring for repeated restart patterns. These improvements are being incorporated into ongoing reliability work. ` `We apologize for any disruption this caused.`