Apify incident

Major systems outage

Apify experienced a critical incident on February 22, 2025 affecting Web (apify.com) and Datacenter and 1 more component, lasting 1h 26m. The incident has been resolved; the full update timeline is below.

Started: Feb 22, 2025, 10:12 PM UTC
Resolved: Feb 22, 2025, 11:38 PM UTC
Duration: 1h 26m
Detected by Pingoru: Feb 22, 2025, 10:12 PM UTC

Affected components

Web (apify.com)DatacenterDatasetConsole (console.apify.com)Request queueResidentialAPI (api.apify.com)Key-value storeSERPActors

Update timeline

investigating Feb 22, 2025, 10:12 PM UTC

We are investigating the issue.
investigating Feb 22, 2025, 10:13 PM UTC

We are continuing to investigate this issue.
identified Feb 22, 2025, 11:05 PM UTC

We've identified the cause of the issue and are taking steps to resolve it.
monitoring Feb 22, 2025, 11:25 PM UTC

A fix has been implemented and we're monitoring the results. It might take some time to allocate all jobs, until all services have fully recovered. We're sorry for the inconvenience. We will implement new measures based on this critical incident to prevent similar incidents in the future.
monitoring Feb 22, 2025, 11:33 PM UTC

We are continuing to monitor for any further issues.
resolved Feb 22, 2025, 11:38 PM UTC

This incident has been resolved.
postmortem Feb 26, 2025, 02:21 PM UTC

Saturday outage post-mortem Date of incident: 2025-02-22 Impact: * Complete downtime of the Apify platform from 22.00 UTC to 23.00 UTC. * No Actors could be started. * Those already running were paused and continued running after the incident was resolved. * There was no data loss, but the workloads were disrupted. What happened: * Our user base is growing fast, and the number of new signups is increasing exponentially. The exponentially increased load revealed suboptimalities that had not manifested before. * One misconfigured query then caused a peak in memory due to a load of a suboptimal index, which caused the cluster to crash. What we did: * The cluster size was increased to ensure enough memory until the issue was fixed. * Then, the cluster and all the dependent systems were restarted. Next steps: * We are already re-architecting our primary database clusters to optimize systems for future growth. * As part of this, we aim to decrease the size of all our clusters to ensure performant restarts in case of a problem. * We are currently optimizing indexing and querying among heavy-load components. * In addition, we identified missing metrics that would indicate similar problems ahead of time. We sincerely apologize for the disruption and appreciate your patience. If you have any questions, please reach out to [[email protected]](mailto:[email protected]). Sincerely, Apify engineering team