Voyado incident

Voyado Engage - Emails delayed

Voyado experienced a notice incident on February 9, 2025 affecting Messaging, lasting 53m. The incident has been resolved; the full update timeline is below.

Started: Feb 09, 2025, 09:32 AM UTC
Resolved: Feb 09, 2025, 10:26 AM UTC
Duration: 53m
Detected by Pingoru: Feb 09, 2025, 09:32 AM UTC

Affected components

Messaging

Update timeline

identified Feb 09, 2025, 09:32 AM UTC

We recently experienced an issue that has now been resolved. As a result, some messages may be delayed. We are actively working to send them out as soon as possible.
resolved Feb 09, 2025, 10:26 AM UTC

The issue has now been resolved, and we have processed the delayed messages.
postmortem Feb 25, 2025, 07:12 AM UTC

**Summary** On the morning of February 9th, we detected system slowness through triggered warnings. Initially, it appeared to be linked to a single tenant's large-scale send-out, but further investigation revealed that multiple tenants were affected. An alert was later triggered indicating that a shared in-memory database, which helps process messages efficiently, was unavailable. This caused delays in message processing, impacting approximately 130 tenants. While most messages were eventually processed automatically, some required manual intervention. The maximum delay experienced was up to three hours, though this only affected a small number of messages for a few tenants. **Customer Impact** Customers experienced delays in their scheduled and automated message send-outs, including both SMS and emails. The disruption was due to a shared in-memory database becoming unavailable, which paused message processing. Once the system resumed, a backlog caused further delays. Our on-call team manually resent messages that got stuck, but a small portion of messages for a few tenants could not be resent. These customers were contacted directly. Delays ranged from as little as five minutes up to a maximum of 240 minutes. **Root Cause** The issue was caused by an unexpected data handover problem in a shared in-memory database, which temporarily lost track of some messages. This database is designed to handle a high volume of messages quickly and efficiently. Normally, if there’s an issue, the system switches to a backup automatically. However, in this case, when the switch happened, some data was lost. As a result, the system had trouble determining which messages had been sent and which were still in progress, leading to delays. Messages scheduled for processing after the disruption were handled as expected once the system recovered. **Mitigation** Since the issue was caused by an automatic switch to a backup system, the system recovered on its own. However, our team had to manually resend messages that had gotten stuck in the process. **Next Steps** We are currently evaluating improvements in the following areas: * Enhancing system robustness to minimize the risk of data loss during failovers. * Implementing automatic resending of delayed messages to quickly mitigate the effects of any disruption. We appreciate your patience and understanding. Our commitment remains to providing a reliable and seamless experience on the Engage platform. If you have any further questions or concerns, please reach out to our support team.