Uploadcare incident

Upload service degradation.

Uploadcare experienced a major incident on December 12, 2018 affecting Upload API, lasting 9h 29m. The incident has been resolved; the full update timeline is below.

Started: Dec 12, 2018, 11:34 PM UTC
Resolved: Dec 13, 2018, 09:03 AM UTC
Duration: 9h 29m
Detected by Pingoru: Dec 12, 2018, 11:34 PM UTC

Affected components

Upload API

Update timeline

investigating Dec 12, 2018, 11:34 PM UTC

We're experiencing issues with out Upload API.
investigating Dec 12, 2018, 11:44 PM UTC

We are continuing to investigate this issue.
identified Dec 13, 2018, 01:27 AM UTC

The issue has been identified and a fix is being implemented.
monitoring Dec 13, 2018, 01:54 AM UTC

A fix has been implemented and we are monitoring the results.
resolved Dec 13, 2018, 09:03 AM UTC

This incident has been resolved.
postmortem Dec 17, 2018, 04:43 PM UTC

On December 12, we had a degradation of our Upload API. Most users were unable to upload files for 3 hours 11 minutes between Dec 12, 22:39 GMT and Dec 13, 01:50 GMT. ## What happened Requests to Upload API were either handled extremely slowly or \(most of them\) rejected by our web workers. Scaling up our Upload API fleet didn't help. ## What really happened Further investigation revealed that: — Slow requests were consuming all available web workers and without available workers requests were rejected by nginx. — Handled requests were slow due to constant database locks on one database table. — Database locks were caused by dramatical change of tracked usage statistics \(change of project settings by one of our largest customers\). We've spent most of the time during incident on investigation and figuring our what is happening. Actual DB load was average, and DB was wrongly dismissed as source of issues at first. Once we've the root cause, the fix was trivial and took minutes to implement and deploy. ## What we have done We turned off usage tracking for particular customer. ## What we will do — Refactor statistic tracking, so it does not affect our core service. — Add more specific monitors to our DB, so we could identify problems of similar nature much faster.