Currents incident

Slowness in data reporting and ingestion

Minor Resolved View vendor source →
Started
Apr 14, 2026, 09:30 AM UTC
Resolved
Apr 14, 2026, 09:30 AM UTC
Duration
Detected by Pingoru
Apr 14, 2026, 09:30 AM UTC

Update timeline

  1. resolved Apr 14, 2026, 09:30 AM UTC

    Type: Incident Duration: 6 hours and 3 minutes Affected Components: Data Pipeline, Data Injestion Apr 14, 09:30:00 GMT+0 - Investigating - We are currently investigating this incident. Apr 14, 15:07:12 GMT+0 - Identified - The issue was caused by a stalled bullmq queue and a slowness in redis cluster. We have restored the operational capacity, investigating the root cause. Apr 14, 15:16:09 GMT+0 - Monitoring - We are monitoring ingestion pipeline performance and stability before declaring that the incident is resolved. Apr 14, 15:32:39 GMT+0 - Resolved - This incident has been resolved. A detailed post-mortem will follow. Apr 15, 03:47:55 GMT+0 - Postmortem - ## Incident Summary On Apr 14 we had a production degradation caused by the shared ops queue becoming saturated. This was not driven by an unusual traffic spike alone. The deeper issue was that a recent change introduced high-volume step-upload work into the hot path of the shared ops queue, so traffic patterns that had been tolerable before now created queue backlog, worker saturation, and customer-visible latency. ## Timeline * 09:32 UTC: jobs started accumulating in the queue and pending duration increased sharply. * 09:32 UTC onward: writer workers saturated and latency-sensitive ops work started competing with a large volume of step-upload jobs. * 11:30 UTC: main Redis cluster CPU reached 100%. * EU team member declared incident and escalated the issue to NA team. The response was significantly delayed due to lack of adequate notification setting on on-call person mobile device. * 15:00 UTC: autoscaling and cleanup reduced the backlog, but in-flight update request expire after \~2 hours - some customers experienced data loss. ## Root Cause * Data ingestion pipeline generates a distinct task to process step-level data, a new task is created in operations queue for every attempt with steps. * Those tasks are delayed and throttles by a fixed interval, which synchronized large batches into thundering herds instead of smoothing load. * A single task then became the dominant queue workload and starved more latency-sensitive ops tasks, spiking workers CPU to 100% and creating cascading effects on the rest of the system. In short: the root cause was architectural. We moved a high-volume task into the shared hot path, and that queue was no longer reserved for the more time-sensitive work it previously handled. ## Contributing Factors * During the last two weeks we significantly refactored our architecture and moved some components to a long-running ECS jobs in order to reduce the number of connections. * We moved step-processing task into the ops queue. * Writer capacity and autoscaling were not aggressive enough for this new workload shape. * We lacked early alerts on worker saturation. ## Alerts We already have alerts for the following infrastructure components: * Alert on ops queue waiting depth. * Alert on writer worker CPU saturation and sustained concurrency saturation. * Alert on Redis CPU saturation and latency. ## Escalation And Communication The most impactful issues was non-adequate escalation. Despite documented escalation procedures and policies, the on-call person notification setting on personal mobile device were silenced. After the issue was resolved, Currents purchased a dedicated on-call SaaS with a dedicated mobile application that bypasses silenced notification settings. We tested and educated all team members on how to use it. ## Technical Follow-ups * Remove hot-path steps processing task the shared ops queue. * Add the alerts above with explicit thresholds and owners.

Looking to track Currents downtime and outages?

Pingoru polls Currents's status page every 5 minutes and alerts you the moment it reports an issue — before your customers do.

  • Real-time alerts when Currents reports an incident
  • Email, Slack, Discord, Microsoft Teams, and webhook notifications
  • Track Currents alongside 5,000+ providers in one dashboard
  • Component-level filtering
  • Notification groups + maintenance calendar
Start monitoring Currents for free

5 free monitors · No credit card required