Etleap incident

Increased latency for pipelines in Hosted environment

Major Resolved View vendor source →

Etleap experienced a major incident on October 20, 2025 affecting UI and Pipelines and 1 more component, lasting 11h 28m. The incident has been resolved; the full update timeline is below.

Started
Oct 20, 2025, 10:15 AM UTC
Resolved
Oct 20, 2025, 09:43 PM UTC
Duration
11h 28m
Detected by Pingoru
Oct 20, 2025, 10:15 AM UTC

Affected components

UIPipelinesPipelinesAPIEvent Streamsdbt Schedules

Update timeline

  1. monitoring Oct 20, 2025, 10:15 AM UTC

    AWS Outage Operational issues – Multiple services (N. Virginia) Outage first reported Mon 20 October 07:11am UTC (12:11am PTC) Outage has started recovering Mon 20 October 09:27am UTC (02:27am PTC)

  2. monitoring Oct 20, 2025, 11:29 AM UTC

    VPCs deployed in US East (N. Virginia) also affected

  3. monitoring Oct 20, 2025, 01:15 PM UTC

    We are continuing to monitor for any further issues.

  4. monitoring Oct 20, 2025, 01:22 PM UTC

    We are continuing to monitor for any further issues.

  5. monitoring Oct 20, 2025, 01:31 PM UTC

    We are continuing to monitor for any further issues.

  6. monitoring Oct 20, 2025, 01:41 PM UTC

    We were experiencing increased errors in both the UI and API in the US hosted environment due to a credentials error caused by the on-going AWS outage. We have implemented a fix and are seeing a decreased rate of errors in both the API and UI.

  7. monitoring Oct 20, 2025, 04:35 PM UTC

    The AWS outage in US East (N. Virginia) is still ongoing. They have identified the root cause to be an internal networking issues and have throttled requests for new EC2 instances. This is currently causing potential outages to the following Etleap components: - Pipelines - may become latent as EMR may fail to scale up for increases in demand. - dbt Schedules - may become latent as EMR may fail to scale up for increases in demand. - Event Streams - may fail to read from source as the autoscaling group that serves these connections fail to provision new instances. Our AI and UI components are currently fully operational.

  8. monitoring Oct 20, 2025, 05:41 PM UTC

    We were able to recover some capacity for our Event Stream endpoint; They are currently able to receive data, however request latencies are still high. We are working to provision extra capacity.

  9. monitoring Oct 20, 2025, 05:42 PM UTC

    We are continuing to monitor for any further issues.

  10. monitoring Oct 20, 2025, 05:46 PM UTC

    We are seeing pipeline and dbt schedule latencies recovering; We are continuing to monitor the overall recovery.

  11. monitoring Oct 20, 2025, 05:55 PM UTC

    We are seeing increased connectivity issues to our event streaming endpoints due to networking issues

  12. monitoring Oct 20, 2025, 06:17 PM UTC

    We are able to provision extra capacity for our event streaming endpoints, and we are seeing a reduction in error rates; We are continuing to monitor the situation.

  13. monitoring Oct 20, 2025, 07:43 PM UTC

    Events stream ingestion has fully recovered; We are continuing to monitor pipeline recovery

  14. monitoring Oct 20, 2025, 08:13 PM UTC

    We are seeing connectivity issues affecting our streaming endpoints; We are investigating the root cause.

  15. monitoring Oct 20, 2025, 08:36 PM UTC

    We have addressed capacity and networking issues for our streaming ingest endpoints.

  16. resolved Oct 20, 2025, 09:43 PM UTC

    We are seeing most pipelines and dbt schedules have recovered. For any remaining issues, Etleap Support has reached out directly to affected customers, and we are working on addressing the last remaining issues. For private deployments, Etleap Support has reached out if any remedial steps are required.

  17. postmortem Oct 21, 2025, 09:27 PM UTC

    On October 20, 2025, between 07:13 UTC and 22:25 UTC, we experienced a disruption affecting multiple services due to the widespread outage in the AWS `us-east-1` region which we operate our US deployment out of.. # **Pipeline Operations** From 07:13 UTC to 09:21 UTC, pipeline activities were unavailable due to outages in several dependent AWS services, including DynamoDB, SQS, SNS, and Glue. Between 07:30 UTC and 08:28:48 UTC, we were unable to send SNS notifications for completed activities. Beginning at 09:21 UTC, new activities could be initiated; however, recovery was delayed as EC2 instance provisioning was throttled, limiting our ability to restore capacity promptly. Full recovery of all pipeline activities was achieved by 18:40 UTC. Throughout the course of the day, we observed certain source and destination connections either failing to be extracted from or failing to be loaded to do to their own use of AWS infrastructure. For more information, we recommend reviewing the status pages for these third parties. # **Webhook/Event Stream Services** Between 08:20 UTC and 17:35 UTC, webhook \(event stream\) endpoints were unable to receive events. This was caused by insufficient EC2 capacity and networking issues between our load balancer and EC2 targets. Recovery began at 17:35 UTC, with intermittent connectivity issues persisting until 22:25 UTC, at which point all services and capacity were fully restored. During this period, some webhook calls were responded with a 502. Depending on how the sender is configured, these may have been retried until the events were processed successfully. # **Customer Impact** During the outage, the webhook endpoints returned HTTP 502 responses. Any messages that received this response should be retried. Any SNS notifications for activities completed between 07:30 UTC and 08:28:48 UTC were not sent.