Currents incident

Runs are not being updated and timed out

Minor Resolved View vendor source →

Currents experienced a minor incident on October 20, 2025, lasting —. The incident has been resolved; the full update timeline is below.

Started
Oct 20, 2025, 10:00 AM UTC
Resolved
Oct 20, 2025, 10:00 AM UTC
Duration
Detected by Pingoru
Oct 20, 2025, 10:00 AM UTC

Update timeline

  1. resolved Oct 20, 2025, 10:00 AM UTC

    Type: Incident Duration: 12 hours and 22 minutes Affected Components: Data Pipeline, API - HTTP REST API, , Data Injestion, API - Dashboard Browsing, API → Oct 20, 10:00:00 GMT+0 - Investigating - Today on 01:00 PT, AWS experience an outage (us-east-1). Most of AWS services have been restored, yet we are still facing issues with recovering all of our services. Oct 20, 10:00:00 GMT+0 - Investigating - Our auto-scaling is currently impacted due to an AWS EC2 outage. We are waiting for new instances to be provisioned in order to resume and complete all pending tasks. Oct 20, 12:44:55 GMT+0 - Identified - Due to AWS us-east-1 outage we are still experiencing issues. There is a big backlog of data update tasks we are unable to process because of lack of EC2 resources. We are looking for alternative runtimes to restore the functionality. Relevant excerpt from AWS: \[04:48 AM PDT\] We continue to work to fully restore new EC2 launches in US-EAST-1\. We recommend EC2 Instance launches that are not targeted to a specific Availability Zone (AZ) so that EC2 has flexibility in selecting the appropriate AZ. The impairment in new EC2 launches also affects services such as RDS, ECS, and Glue. We also recommend that Auto Scaling Groups are configured to use multiple AZs so that Auto Scaling can manage EC2 instance launches automatically. Oct 20, 22:21:31 GMT+0 - Resolved - Our systems are back to normal. After AWS restored their services we were able to start processing the incoming data. There's still a backlog of non-processes event accumulated during the outage. * Our focus is on processing the newly created runs without any delay. * Due to nature of our service there's less value in real-time processing of the delayed events because the associated runs are already expired and timed out. * The backlog of post-process still to be process for analytics and performance analysis. This outage revealed a few performance and resilience related issues with our system. We will follow up with a more detailed analysis.