Voyado incident

[Engage] - Disturbance identified, affecting Messages and Automations

Voyado experienced a minor incident on March 2, 2025 affecting Messaging and Automations, lasting 2h 47m. The incident has been resolved; the full update timeline is below.

Started: Mar 02, 2025, 08:36 AM UTC
Resolved: Mar 02, 2025, 11:24 AM UTC
Duration: 2h 47m
Detected by Pingoru: Mar 02, 2025, 08:36 AM UTC

Affected components

MessagingAutomations

Update timeline

investigating Mar 02, 2025, 08:36 AM UTC

We have identified a disturbance in Voyado which is affecting Message sendouts and Automations, which are currently not being processed. We are currently investigating the issue.
investigating Mar 02, 2025, 08:55 AM UTC

We are continuing to investigate this issue.
monitoring Mar 02, 2025, 09:00 AM UTC

A fix has been implemented with promising results. We are seeing that Messages are being sent again and Automations are being processed. We will continue to monitor the situation.
monitoring Mar 02, 2025, 09:25 AM UTC

The processing of Messages and Automations are still looking good after the implemented fix. There are still some delays from queued up messages that are being processed but Automations are back to normal.
monitoring Mar 02, 2025, 09:58 AM UTC

As previously stated, operations are back to normal. We are still working on the aftermath (i.e. resending email messages that were not sent due to the outage). Automations are fully synced and are running as normal. Next update will come when all messages are resent.
resolved Mar 02, 2025, 11:24 AM UTC

All massages are now resent. The incident is now resolved.
postmortem Mar 12, 2025, 08:26 AM UTC

## Summary Between the hour of 09:12 and 09:52 on February 2nd we encountered an issue affecting a central In-memory database used by many processes in the platform. The issue landed the service in a state which didn't trigger failover to backup services, causing various anomalies throughout the platform. Among those a large number of messages being delayed \(requiring manual resend\), automation events not triggering, reported login issues and more. ## Customer Impact The issue mainly affected customers who had messages and activity execution during the time frame 09:12 - 09:52 with a delay. ## Root Cause and Mitigation **Root Cause** The root cause of the issue was a central in-memory database ending up in a bad state. The database is used for storing data for quick access throughout the platform in a high load – low latency configuration. As this data is needed in various processes the effect was spread over multiple parts of the platform, but only specific use cases were greatly affected from a user perspective. The database has a redundant setup, with a primary to multiple replica configuration, where failover to a replica is automatic should the primary service run into issues. In this instance all servers in the setup ended up in a replica state, with no primary resource active, thus causing the issue. **Mitigation** Enforce primary: To mediate the issue we enforced a new primary resource in the configuration which returned Engage to a normal state where messages and execution of activities were functioning as expected. ## Next steps Unfortunately, this has happened before and although we did take actions to mitigate it from happening again it did. The root cause is still, after investigation, not clear. Our intention is now to look over our current in-memory database setup and take action to upgrade and update the setup.