Uploadcare incident

Service degradation

Major Resolved View vendor source →

Uploadcare experienced a major incident on October 2, 2023 affecting Upload API and Video processing, lasting —. The incident has been resolved; the full update timeline is below.

Started
Oct 02, 2023, 05:05 PM UTC
Resolved
Oct 02, 2023, 05:05 PM UTC
Duration
Detected by Pingoru
Oct 02, 2023, 05:05 PM UTC

Affected components

Upload APIVideo processing

Update timeline

  1. resolved Oct 02, 2023, 05:05 PM UTC

    From 14:15 to 16:45 UTC we’ve experienced higher latencies of from_url uploads and with video processing. We’ve identified the source of the problem, eliminated it and are monitoring the situation. These services are fully functional now.

  2. postmortem Oct 04, 2023, 06:38 AM UTC

    ## Upload API and Video processing services degradation \(incident #5r4zj8shr69c\) **Date**: 2023-10-02 **Authors**: Alyosha Gusev, Denis Bondar **Status**: Complete **Summary**: From 14:15 to 16:45 UTC we’ve experienced higher latencies of Upload API and with video processing due to very high interest in these services. **Root Causes**: Cascading failure due to combination of exceptionally high amount of requests to Upload API. **Trigger**: Latent bug triggered by sudden traffic spike. **Resolution**: Changed our throttling politics, increased resources for processing. **Detection**: Our Customer Success team detected the issue and escalated to the Engineering team. **Action Items**: | Action Item | Type | Status | | --- | --- | --- | | Test corresponding alerts for correctness | mitigate | **DONE** | | Improve our upload processing system to remove bottleneck that we found | prevent | **DONE** | | Fix service access issue for team members that form potential response teams | mitigate | **DONE** | ‌ ## Lessons Learned **What went well** * Due to distributed nature of Uploadcare, this incident has no effect on most of our services. This degradation didn’t affect storage, processing and serving files that were already stored by Uploadcare CDN. * Our incident mitigation strategy was right and worked immediately. **What went wrong** * This incident was detected in non-automatic way due to alert misconfiguration. * Due to hardening security standards in our organisation, not all of incident responders had access to Statuspage to update our customers in timely manner. ## Timeline 2023-10-02 _\(all times UTC\)_ * 14:15 Our upload processing queue start filling * 14:20 **SERVICE DEGRADATION BEGINS** * 15:23 Our customer success team escalates issue to Infrastructure team * 15:31 Issue localised * 15:41 Incident response team is formed * 15:51:13 Adjusted our throttling policies * 15:51:38 Increased number of processing instances * 16:40 **SERVICE DEGRADATION ENDS** Processing queues clear