Voyado incident

We are currently experiencing a delay in sending messages.

Voyado experienced a minor incident on April 21, 2025 affecting Messaging, lasting —. The incident has been resolved; the full update timeline is below.

Started: Apr 21, 2025, 09:58 PM UTC
Resolved: Apr 21, 2025, 09:58 PM UTC
Duration: —
Detected by Pingoru: Apr 21, 2025, 09:58 PM UTC

Affected components

Messaging

Update timeline

investigating Apr 21, 2025, 09:32 AM UTC

We are currently experiencing a delay in sending messages. We are investigating this and working on a solution.
investigating Apr 21, 2025, 10:53 AM UTC

We are continuing to investigate this issue and working on a solution.
investigating Apr 21, 2025, 01:16 PM UTC

The degradation has been mitigated and we're currently working on addressing the aftermath (making sure all delayed messages are sent)
resolved Apr 21, 2025, 09:58 PM UTC

This incident has been resolved.
postmortem May 02, 2025, 08:22 AM UTC

**Summary** On the morning of April 21st, Voyado Engage experienced an issue causing delays in the delivery of email messages. This primarily impacted messages sent through automation workflows. While no messages were lost, many were delivered later than intended. The situation was fully resolved the same day, and we are taking steps to ensure it does not recur. **Customer Impact** Approximately fifty percent of our customer base were affected by the incident. The majority of the delays impacted automated email workflows, though some manual send-outs were also affected. While all messages were eventually delivered, delays ranged from about 30 minutes to up to 3 hours for some customers. **Root Cause** The incident was mainly caused by inefficient memory management in the mail-processing application code. Over time, servers' memory usage steadily increased, peaking on April 21st. Combined with a few exceptionally large email campaigns, the system experienced severe resource pressure: * Memory Leaks: Memory was not properly released, causing sustained high usage that led to issues with Time-outs and Storage Delays as well as high CPU load: * Timeouts and Storage Delays: The platform struggled to write data to storage fast enough caused by the high memory usage, resulting in application slowdowns. * CPU Load: Some mail servers reached unnormal high CPU usage, worsening the delays. Importantly, no failures were detected in our cloud infrastructure, and no messages were lost. **Mitigation** Once the incident was identified: * A full application deploy was initiated to clear up memory usage and stabilize the system. Essentially performing a reboot of the application. * On-call engineers monitored the queues and gradually cleared all delayed messages. * Additional manual steps were taken to resend any stuck processes, ensuring no message was left behind. By 19:00 CEST on April 21st, all messages had been successfully sent and the system was back to a healthy operational state. **Next Steps** To prevent similar issues in the future, we are taking several actions to evaluate and potentially adjust memory utilization in the application, in addition to fine-tune monitoring of memory and storage health. We are also updating our incident management process to enable faster mitigation actions should similar symptoms appear. We appreciate your patience and understanding, and apologize for any inconvenience. We remain committed to providing a stable and reliable platform experience.