Voyado incident

[Engage] Messages not being sent

Voyado experienced a critical incident on March 11, 2025 affecting Messaging, lasting 3h 33m. The incident has been resolved; the full update timeline is below.

Started: Mar 11, 2025, 08:24 AM UTC
Resolved: Mar 11, 2025, 11:57 AM UTC
Duration: 3h 33m
Detected by Pingoru: Mar 11, 2025, 08:24 AM UTC

Affected components

Messaging

Update timeline

investigating Mar 11, 2025, 08:24 AM UTC

We are currently experiencing issues with sendouts and they are not being sent. We are investigating this and working on a solution.
investigating Mar 11, 2025, 08:57 AM UTC

We have identified that the issue is only affecting a subset of customers. We are continuing to investigate the issue and are preparing to deploy a fix.
identified Mar 11, 2025, 09:46 AM UTC

All the messages that got stuck have been resent. We are rolling out a fix to mediate the cause.
resolved Mar 11, 2025, 11:57 AM UTC

This incident has been resolved.
postmortem Apr 06, 2025, 09:22 PM UTC

## Summary On the morning of March 11, 2025, an issue occurred in the Engage platform that resulted in delays for message send-outs for a sub-set of our customers. The incident was triggered by an unexpected event in our in-memory database setup, which temporarily disrupted the platform’s ability to process and send messages. The issue was resolved rapidly and all affected send-puts were successfully delivered, either automatically or through manual resending. ### Customer Impact Approximately 54 customers experienced a temporary halt in their message send-outs for about one hour. Most messages were eventually sent out automatically once the issue resolved itself, but a smaller portion required manual resending by our team. No messages were lost. ### Root Cause The issue was caused by an unexpected failover in our in-memory database, which altered the primary-secondary configuration and triggered faulty callbacks in our system. This misconfiguration prevented messages from being processed as expected which led to the delay. ### Remediation & Mitigation * Our team identified the issue quickly through our monitoring and began troubleshooting. * A hotfix was implemented the same morning to remediate the faulty callbacks that prevented the message execution and to mitigate future occurrence of the unexpected behavior. * Messages stuck in the queue were either automatically processed or manually resent by our support team. ### Next Steps We recognize that similar in-memory database-related issues have occurred in the past. Based on recent events and ss part of our continuous improvement and reliability work, we are reviewing our in-memory database setup to improve its resilience and behavior during failovers. We appreciate your patience and understanding, and we remain committed to providing a stable and reliable platform experience.