Voyado incident

[Engage] Message Delays and API Slowness

Voyado experienced a minor incident on September 17, 2025 affecting API and Messaging, lasting 1h 39m. The incident has been resolved; the full update timeline is below.

Started: Sep 17, 2025, 07:41 AM UTC
Resolved: Sep 17, 2025, 09:21 AM UTC
Duration: 1h 39m
Detected by Pingoru: Sep 17, 2025, 07:41 AM UTC

Affected components

APIMessaging

Update timeline

investigating Sep 17, 2025, 07:41 AM UTC

We are currently experiencing delays with sendouts. A subset of customers may also be experiencing degraded API performance. We are investigating and will provide updates as soon as we have more information.
investigating Sep 17, 2025, 08:08 AM UTC

Our teams are fully engaged and continue working with high priority to resolve the situation. However, the issue is still ongoing and some users may continue to experience problems. We’ll share further updates as soon as we have more information.
identified Sep 17, 2025, 08:22 AM UTC

We have identified the issue and taken remedial action. Initial improvements have been observed, with API performance and message delays showing signs of recovery. We're currently working on an additional fix to ensure long-term stability and are closely monitoring the situation before confirming full return to normal operations.
identified Sep 17, 2025, 08:59 AM UTC

A hotfix has been prepared and is currently being deployed. We will share an update as soon as we know the result and the effect it has.
resolved Sep 17, 2025, 09:21 AM UTC

We have returned to normal API operations, and new messages are being sent as expected. The messages that were queued and delayed due to the incident are now being processed. A follow-up post-mortem incident report will be shared.
postmortem Sep 25, 2025, 09:44 AM UTC

### Summary On the morning of September 17th, the Engage platform experienced degraded performance primarily affecting messaging and APIs. The root cause was traced to a scenario triggered by the intentional disabling of messaging functionality for a specific tenant. This caused an unusually high load on our infrastructure, causing ripple effects in parts of the system affecting other tenants as well. ### Customer Impact * **Messaging delays** of 10–30 minutes for most customers. One tenant experienced longer delays \(up to 60 minutes\) and required minor manual intervention for a small number of messages. * **APIs affected**: CreateContact, GetContact, and UpdateContact showed poor response times for a sub-set of customers. No messages were lost, and all systems were fully recovered by 11:10 CEST. ### Root Cause The incident stemmed from disabling messaging functionality for a specific tenant in combination with messaging activity for that tenant. This generated a lot of activity in the system among other triggering a poorly optimized database query. The query caused significant load on our database servers, leading to cascading delays across queues and APIs. ### Mitigation * Redeployment of services to free up resources * Actions to unblock queued up events * A hotfix was deployed to optimize an identified complex query and reduce system strain. ### Next Steps To prevent similar issues we are working on improvements for: * Query handling for stopped messages will be further optimized. * Enhanced monitoring and tooling around internal message handling usage and query execution are being introduced. We appreciate your patience and understanding, and apologize for any inconvenience. Please reach out if you have any questions or need further clarification.