Vero incident

Delays in API and message processing

Major Resolved View vendor source →

Vero experienced a major incident on July 30, 2025 affecting Vero 1.0: Ingestion API and Transactional emails and 1 more component, lasting 11h 46m. The incident has been resolved; the full update timeline is below.

Started
Jul 30, 2025, 02:57 PM UTC
Resolved
Jul 31, 2025, 02:43 AM UTC
Duration
11h 46m
Detected by Pingoru
Jul 30, 2025, 02:57 PM UTC

Affected components

Vero 1.0: Ingestion APITransactional emailsWorkflowsBehavioral emailsVero 2.0: ImportsVero 2.0: Newsletter processing

Update timeline

  1. identified Jul 30, 2025, 02:57 PM UTC

    We are currently experiencing delays with our API and message processing systems. Our team has implemented a temporary fix and performance is improving, but you may still encounter slower response times or failed requests to Vero's Track API. Message delivery across all channels—including workflows and newsletters—is currently delayed. We're working to fully resolve these issues and will update you as soon as normal service is restored.

  2. monitoring Jul 30, 2025, 03:32 PM UTC

    The team is continuing to monitor system stability and performance.

  3. monitoring Jul 30, 2025, 06:25 PM UTC

    The team is continuing to monitor system stability and performance.

  4. resolved Jul 31, 2025, 02:43 AM UTC

    Note: we are trying a new format for our post-mortems. *What was the impact* API processing, workflows (including transactional messages and workflows) and newsletters were intermittently offline for a brief period (15:00-15:30) and then delayed for several hours between 15:00 and 23:30 UTC as we caught up processing. *What caused the impact* A datastore used by our queueing system ran out of memory. *Why'd it happen* After QA testing, we deployed a configuration change to our queueing system yesterday to improve the way processing API and workflow jobs are balanced between customers. This change led to unexpected growth in memory in our queueing system, causing it to fail. Whilst we were alerted the memory growth rate was unprecedently large. *What changes have we made* - Investigated the cause of the memory growth and patched it. - Adjusted our alerting on memory growth so we are alerted earlier to give us more time to fix this issue, in the unlikely event it occurs again. *Any other information* This particular part of our infrastructure has been frustratingly brittle over the last year or so, due to inefficient data storage. As a result we elected to migrate to a data store better-suited for these workloads (DynamoDB) at the start of 2025. That work is nearing completion. It will improve queue processing by 10x at least, giving us much-needed headroom. Questions: please email us at [email protected].