Upstash incident

Performance degradation on QStash

Upstash experienced a major incident on November 13, 2024 affecting EU-CENTRAL-1, lasting 3h 12m. The incident has been resolved; the full update timeline is below.

Started: Nov 13, 2024, 02:14 PM UTC
Resolved: Nov 13, 2024, 05:27 PM UTC
Duration: 3h 12m
Detected by Pingoru: Nov 13, 2024, 02:14 PM UTC

Affected components

EU-CENTRAL-1

Update timeline

investigating Nov 13, 2024, 02:14 PM UTC

We are currently investigating this issue.
identified Nov 13, 2024, 03:38 PM UTC

The issue has been identified and a fix is being implemented.
resolved Nov 13, 2024, 05:27 PM UTC

We will be sharing a postmortem about the incident soon.
postmortem Nov 13, 2024, 08:38 PM UTC

**Product:** QStash **Impact:** Degraded performance, delayed processing of events, and duplicate event deliveries for some customers ## Incident Summary QStash experienced an incident marked by a sudden and extreme load on our servers. This caused a degradation in performance, with extremely high latency for event processing for all users. We also noticed some of the events being delivered multiple times to some of the users. To mitigate the high load, we have increased the capacity as our initial response while investigation proceeds. Eventually, fixes for the issues are confirmed with an issue reproducer and deployed to production. ## Root Cause Analysis In a certain type of usage, failure handling of [failureFunction](https://upstash.com/docs/workflow/basics/serve#failurefunction) can cause recursive calls which causes a leak in the queue of the tasks, causing a severe load on the QStash servers. This also triggered an edge case which caused some of the events to be delivered multiple times. ## Resolution Two hotfixes to the QStash processes are deployed 1. Prevent recursive calls within the failure function. 2. Eliminate duplicate deliveries while keeping "at least once delivery" guarantee. These are verified to successfully resolve the root cause, normalizing server load and restoring standard event processing operations. ## Impact on Customers High latency of event processing is observed for all users. Some users received duplicate event deliveries. No events were lost, and all were delivered as part of our "at least once delivery" guarantee. Customers do not need to take any corrective action, as workflows have returned to normal and preventive fixes are deployed.