Customer.io incident

Some US customers may be impacted by data delays and UI errors

Customer.io experienced a major incident on October 30, 2025 affecting Data Processing and Message Sending and 1 more component, lasting 4h 32m. The incident has been resolved; the full update timeline is below.

Started: Oct 30, 2025, 05:25 PM UTC
Resolved: Oct 30, 2025, 09:58 PM UTC
Duration: 4h 32m
Detected by Pingoru: Oct 30, 2025, 05:25 PM UTC

Affected components

Data ProcessingMessage SendingManagement Interface

Update timeline

investigating Oct 30, 2025, 05:25 PM UTC

Some US customers will experience errors in the UI and delays in data processing and sending. We do not anticipate any data loss.
investigating Oct 30, 2025, 06:05 PM UTC

We are continuing to investigate this issue.
identified Oct 30, 2025, 06:34 PM UTC

The issue has been identified and a fix is being worked on. Some US customers will continue to experience errors in the UI and delays in data processing and sending. No data loss is anticipated.
identified Oct 30, 2025, 07:03 PM UTC

We are continuing to work on a fix for this issue.
identified Oct 30, 2025, 07:34 PM UTC

We are continuing to work on a fix for this issue.
identified Oct 30, 2025, 08:03 PM UTC

We are continuing to work on a fix for this issue.
identified Oct 30, 2025, 08:11 PM UTC

We are continuing to work on a fix for this issue.
identified Oct 30, 2025, 08:41 PM UTC

We are continuing to work on a fix for this issue.
identified Oct 30, 2025, 09:11 PM UTC

We are continuing to work on a fix for this issue.
monitoring Oct 30, 2025, 09:48 PM UTC

A fix has been implemented and we are monitoring the results.
resolved Oct 30, 2025, 09:58 PM UTC

This incident has been resolved.
postmortem Nov 06, 2025, 10:31 PM UTC

# **Incident Summary** On October 30th, 2025, between 17:12 UTC and 22:10 UTC, some customers with our US data center may have experienced delays in data processing, message sending and user interface errors. This issue did not affect data ingestion and no data was lost. A kernel-level issue caused file system corruption for a database in our cluster which resulted in a database crash. Using internal recovery procedures, we restored service and confirmed full data integrity. # **Root Cause** A filesystem related fault occurred on one of the production databases that stores customer journey information caused the service to fail when reading certain data. Under normal conditions, a replica database would maintain continuity of service. However, in this instance, the database experienced replication lag which prevented an immediate failover. This resulted in a longer recovery period. # **Resolution and Recovery** Engineers repaired and validated the affected database, ensuring data integrity throughout the recovery process. Using recent copies of the data, the team restored the impacted areas, verified system health, and restarted dependent services to confirm normal operation. Following the restoration, a replica database was rebuilt. # **Corrective and Preventative Measures** To ensure greater resilience, the team is enhancing database monitoring to detect early signs of filesystem related faults, updating database recovery strategies based on learnings from this incident, and increasing the frequency of recovery drills to validate new procedures. These measures are being tracked and prioritized within our internal development process.