Currents incident

Data processing delayed

Minor Resolved View vendor source →

Currents experienced a minor incident on October 29, 2024, lasting —. The incident has been resolved; the full update timeline is below.

Started
Oct 29, 2024, 01:00 PM UTC
Resolved
Oct 29, 2024, 01:00 PM UTC
Duration
Detected by Pingoru
Oct 29, 2024, 01:00 PM UTC

Update timeline

  1. resolved Oct 29, 2024, 01:00 PM UTC

    Type: Incident Duration: 4 hours and 48 minutes Oct 29, 13:00:00 GMT+0 - Monitoring - The service is now recovered, we are monitoring for additional errors. We experienced a scaling issue with our data processing service. This is resulting in run results and run start handling being delayed. Some delays may have resulted in failure to record runs. Oct 29, 17:47:55 GMT+0 - Resolved - ## This incident has been resolved All systems are operational. ## Impact Assessment * customers run results were significantly delayed for \~5 hours * the delays caused other issues with services that expected the results like webhook and integrations. ### ## Root cause analysis At \~8:20pm PST (3:20am UTC) we had an internal cleanup task run that removed old docker images used to deploy our data processing service. An oversight in how this was configured vs our processes resulted in the currently deployed service's docker image being removed. Between \~4:00am PST (11:00am UTC) and \~9:00am PST (4pm UTC) the system was not able to successfully take any scaling action while it tried to deploy the deleted image. There were still a small number of instances still running and processing tasks, but not enough to deal with the load. Once we re-deployed a newer image, we were able to quickly recover to normal operating status.