Replicate Outage History

Replicate had 37 outages in the last 2 years totaling 164h 9m of downtime — averaging 1.5 incidents per month.

There were 37 Replicate outages since August 24, 2025 totaling 164h 9m of downtime. Each is summarised below — incident details, duration, and resolution information.

Source: https://www.replicatestatus.com

Minor August 2, 2026

API degraded for A100s

Detected by Pingoru: Aug 02, 2026, 04:18 PM UTC
Resolved: Aug 02, 2026, 04:18 PM UTC
Duration: —

Timeline · 2 updates

identified Aug 02, 2026, 04:09 PM UTC

Status: Identified The API for creating predictions for A100s is currently degraded as a piece of backing infrastructure failed. We are working on getting it back online
resolved Aug 02, 2026, 04:18 PM UTC

Status: Resolved We have restarted the affected component and service should be back to normal. Thank you for your patience

Read the full incident report →

Minor July 31, 2026

Degraded scale-out due to failed setups pulling from huggingface

Detected by Pingoru: Jul 31, 2026, 11:39 PM UTC
Resolved: Jul 31, 2026, 11:39 PM UTC
Duration: —

Affected: A100 HardwareL40S HardwareH100 Hardware

Timeline · 4 updates

identified Jul 31, 2026, 09:25 PM UTC

Status: Identified Models that pull from huggingface during setup have not been succeeding, which is resulting in scale-out delays and queue backups. Affected components A100 Hardware (Degraded performance) H100 Hardware (Degraded performance) L40S Hardware (Degraded performance)
identified Jul 31, 2026, 09:26 PM UTC

Status: Identified We have confirmed that turning off our caching layer for huggingface downloads allows models to successfully set up, so we are making that change for all models while we continue to troubleshoot. Affected components H100 Hardware (Degraded performance) L40S Hardware (Degraded performance) A100 Hardware (Degraded performance)
monitoring Jul 31, 2026, 10:27 PM UTC

Status: Monitoring All affected hardware types have recovered and we are continuing to monitor and troubleshoot. Thanks for your patience! Affected components H100 Hardware (Operational) L40S Hardware (Operational) A100 Hardware (Operational)
resolved Jul 31, 2026, 11:39 PM UTC

Status: Resolved We're looking stable now, and with caching back on. Thanks again for your patience! Affected components A100 Hardware (Operational) H100 Hardware (Operational) L40S Hardware (Operational)

Read the full incident report →

Minor July 23, 2026

Hitting GPU Capacity for H100s creating large queue times for some models

Detected by Pingoru: Jul 23, 2026, 02:19 PM UTC
Resolved: Jul 23, 2026, 02:19 PM UTC
Duration: —

Affected: H100 HardwareOfficial Models

Timeline · 4 updates

identified Jul 22, 2026, 11:51 AM UTC

Status: Identified We are over provisioned on H100s currently which is causing long queue times for some models running on H100s Affected components H100 Hardware (Degraded performance) Official Models (Degraded performance)
monitoring Jul 22, 2026, 05:11 PM UTC

Status: Monitoring GPU usage is back below capacity. We're continuing to monitor. Affected components H100 Hardware (Degraded performance) Official Models (Degraded performance)
monitoring Jul 23, 2026, 03:05 AM UTC

Status: Monitoring This issue is now resolved Affected components H100 Hardware (Operational) Official Models (Operational)
resolved Jul 23, 2026, 02:19 PM UTC

Status: Resolved This issue is resolved Affected components H100 Hardware (Operational) Official Models (Operational)

Read the full incident report →

Minor July 16, 2026

HuggingFace download issues

Detected by Pingoru: Jul 16, 2026, 04:04 PM UTC
Resolved: Jul 16, 2026, 04:04 PM UTC
Duration: —

Timeline · 5 updates

investigating Jul 16, 2026, 08:31 AM UTC

Status: Investigating We are aware of 504s being returned by models that reach out to HuggingFace during setup. We believe this is likely related to the disruption visible at https://status.huggingface.co/. We are monitoring and will update when we can identify the HuggingFace CDN is back up/
monitoring Jul 16, 2026, 09:10 AM UTC

Status: Monitoring HuggingFace is reporting recovery, which we are also seeing on our end. However, we still see minor degradation, which we believe is due to thundering herd once the main issue was resolved. We are still keeping an eye out, but believe we are recovering
monitoring Jul 16, 2026, 09:28 AM UTC

Status: Monitoring We have seen most HuggingFace calls start succeeding, except for a small subset that are being downloaded through cas-bridge.xethub.hf.co "cas-bridge.xethub.hf.co": http://cas-bridge.xethub.hf.co. Most models are now succeeding to spin up
monitoring Jul 16, 2026, 09:46 AM UTC

Status: Monitoring Still seeing signs of improvement, most models are able to spin up with a few still waiting their turn from HuggingFace. Models should spin up after enough tries if they are failing for HuggingFace related reasons.
resolved Jul 16, 2026, 04:04 PM UTC

Status: Resolved We have not seen elevated rates of HuggingFace/model setup errors for a couple hours now, so we believe this incident is cleared

Read the full incident report →

Minor July 14, 2026

H100 GPU shortage resulting in high queue times

Detected by Pingoru: Jul 14, 2026, 05:27 PM UTC
Resolved: Jul 14, 2026, 05:27 PM UTC
Duration: —

Affected: H100 Hardware

Timeline · 3 updates

investigating Jul 13, 2026, 04:26 PM UTC

Status: Investigating Long queue times, especially on BFL models Affected components H100 Hardware (Degraded performance)
monitoring Jul 13, 2026, 05:48 PM UTC

Status: Monitoring We pushed a fix and are monitoring Affected components H100 Hardware (Degraded performance)
resolved Jul 14, 2026, 05:27 PM UTC

Status: Resolved H100 capacity has returned to normal levels Affected components H100 Hardware (Operational)

Read the full incident report →

Minor July 10, 2026

High contention on H100 hardware

Detected by Pingoru: Jul 10, 2026, 11:04 PM UTC
Resolved: Jul 10, 2026, 11:04 PM UTC
Duration: —

Affected: H100 Hardware

Timeline · 2 updates

monitoring Jul 10, 2026, 02:50 PM UTC

Status: Monitoring We are seeing high contention on H100 hardware which is resulting in delays on predictions and scale-out. Affected components H100 Hardware (Degraded performance)
resolved Jul 10, 2026, 11:04 PM UTC

Status: Resolved We are back under max capacity for H100 hardware. Thank you for your patience! Affected components H100 Hardware (Operational)

Read the full incident report →

Minor June 30, 2026

Limited H100 capacity

Detected by Pingoru: Jun 30, 2026, 07:29 PM UTC
Resolved: Jun 30, 2026, 07:29 PM UTC
Duration: —

Affected: H100 Hardware

Timeline · 2 updates

identified Jun 30, 2026, 01:55 PM UTC

Status: Identified We recently received a sharp increase in demand for H100 capacity which is resulting in delayed scale-out and queue backup. Affected components H100 Hardware (Degraded performance)
resolved Jun 30, 2026, 07:29 PM UTC

Status: Resolved The H100 capacity is back in good health. Thanks for your patience! Affected components H100 Hardware (Operational)

Read the full incident report →

Minor June 3, 2026

We're seeing long setup times and high contention for models on some L40S and H200 clusters.

Detected by Pingoru: Jun 03, 2026, 07:06 PM UTC
Resolved: Jun 03, 2026, 07:06 PM UTC
Duration: —

Affected: L40S HardwareH100 Hardware

Timeline · 3 updates

investigating Jun 03, 2026, 05:59 PM UTC

Status: Investigating We're seeing long setup times and high contention for models on some L40S and H200 clusters. Affected components H100 Hardware (Partial outage) L40S Hardware (Partial outage)
investigating Jun 03, 2026, 07:06 PM UTC

Status: Investigating System is back to operating normally Affected components H100 Hardware (Partial outage) L40S Hardware (Partial outage)
resolved Jun 03, 2026, 07:06 PM UTC

Status: Resolved System is back to operating normally Affected components H100 Hardware (Operational) L40S Hardware (Operational)

Read the full incident report →

Minor May 28, 2026

Degraded performance on flux-2-klein-4b

Detected by Pingoru: May 28, 2026, 02:17 PM UTC
Resolved: May 28, 2026, 02:17 PM UTC
Duration: —

Affected: Official Models

Timeline · 2 updates

investigating May 28, 2026, 12:30 PM UTC

Status: Investigating Long queue times for black-forest-labs/flux-2-klein-4b resulting in canceled predictions Affected components Official Models (Degraded performance)
resolved May 28, 2026, 02:17 PM UTC

Status: Resolved This issue has been resolved and queue times are back to normal Affected components Official Models (Operational)

Read the full incident report →

Minor May 21, 2026

Prediction and Training status updates delayed

Detected by Pingoru: May 21, 2026, 11:28 PM UTC
Resolved: May 21, 2026, 11:28 PM UTC
Duration: —

Affected: Streaming APIHTTP APICPU HardwareA100 HardwarePlaygroundHome PageL40S HardwareH100 HardwareT4 Hardware

Timeline · 2 updates

identified May 21, 2026, 09:38 PM UTC

Status: Identified Our message queues for prediction and training status updates are hitting capacity limits which are causing connection failures for queue consumers. We are in the process of bringing additional capacity online. Affected components Playground (Degraded performance) H100 Hardware (Degraded performance) HTTP API (Degraded performance) T4 Hardware (Degraded performance) CPU Hardware (Degraded performance) Home Page (Degraded performance) L40S Hardware (Degraded performance) A100 Hardware (Degraded performance) Streaming API (Degraded performance)
resolved May 21, 2026, 11:28 PM UTC

Status: Resolved Message flows are healthy. Affected components L40S Hardware (Operational) A100 Hardware (Operational) Streaming API (Operational) HTTP API (Operational) T4 Hardware (Operational) CPU Hardware (Operational) Home Page (Operational) Playground (Operational) H100 Hardware (Operational)

Read the full incident report →

Minor May 21, 2026

Constrained H100 capacity

Detected by Pingoru: May 21, 2026, 10:05 PM UTC
Resolved: May 21, 2026, 10:05 PM UTC
Duration: —

Affected: H100 Hardware

Timeline · 2 updates

identified May 21, 2026, 03:09 PM UTC

Status: Identified We are seeing heightened demand for H100 hardware which is causing severe queue delays. Affected components H100 Hardware (Partial outage)
resolved May 21, 2026, 10:05 PM UTC

Status: Resolved H100 hardware contention has resolved. Thank you for your patience! Affected components H100 Hardware (Operational)

Read the full incident report →

Minor May 12, 2026

Constrained capacity for H100 hardware

Detected by Pingoru: May 12, 2026, 07:43 PM UTC
Resolved: May 12, 2026, 07:43 PM UTC
Duration: —

Affected: H100 Hardware

Timeline · 2 updates

identified May 12, 2026, 03:27 PM UTC

Status: Identified Demand for constrained H100 hardware is causing scaling delays. This impacts queue size and inference speed for any models running on H100s. Affected components H100 Hardware (Degraded performance)
resolved May 12, 2026, 07:43 PM UTC

Status: Resolved There is no more contention for H100 hardware. Thank you for your patience! Affected components H100 Hardware (Operational)

Read the full incident report →

Minor April 19, 2026

Degraded A100 hardware

Detected by Pingoru: Apr 19, 2026, 03:02 PM UTC
Resolved: Apr 19, 2026, 03:02 PM UTC
Duration: —

Affected: A100 Hardware

Timeline · 2 updates

monitoring Apr 19, 2026, 02:36 PM UTC

Status: Monitoring All predictions and trainings targeting A100 hardware are experiencing degraded performance while control plane nodes restart. Affected components A100 Hardware (Degraded performance)
resolved Apr 19, 2026, 03:02 PM UTC

Status: Resolved All A100 capacity is back. Thanks for your patience! Affected components A100 Hardware (Operational)

Read the full incident report →

Minor April 9, 2026

A100 capacity unavailable during storage maintenance

Detected by Pingoru: Apr 09, 2026, 05:30 PM UTC
Resolved: Apr 09, 2026, 05:30 PM UTC
Duration: —

Affected: A100 Hardware

Timeline · 2 updates

investigating Apr 09, 2026, 04:43 PM UTC

Status: Investigating The persistent storage for all A100 hardware is under maintenance and is expected to be degraded until completion. Affected components A100 Hardware (Partial outage)
resolved Apr 09, 2026, 05:30 PM UTC

Status: Resolved The maintenance is complete and all systems are reporting healthy. Thank you for your patience! Affected components A100 Hardware (Operational)

Read the full incident report →

Minor March 24, 2026

Downstream errors for Black Forest Labs models

Detected by Pingoru: Mar 24, 2026, 01:34 AM UTC
Resolved: Mar 24, 2026, 01:34 AM UTC
Duration: —

Affected: Official Models

Timeline · 2 updates

identified Mar 23, 2026, 03:57 PM UTC

Status: Identified Some Black Forest Labs models are failing due to downstream errors from BFL. We are monitoring the situation and working on work arounds. BFL Status page: https://status.bfl.ml/ Affected components Official Models (Degraded performance)
resolved Mar 24, 2026, 01:34 AM UTC

Status: Resolved Black Forest Labs has resolved the issue. Affected components Official Models (Operational)

Read the full incident report →

Minor March 10, 2026

Degraded performance on Flux Schnell

Detected by Pingoru: Mar 10, 2026, 06:17 PM UTC
Resolved: Mar 10, 2026, 06:17 PM UTC
Duration: —

Timeline · 3 updates

investigating Mar 10, 2026, 12:56 PM UTC

Status: Investigating We are investigating an outage that is only affecting Flux Schnell.
monitoring Mar 10, 2026, 01:11 PM UTC

Status: Monitoring A GPU provider has an outage. Traffic is being rerouted and we are processing new Flux Schnell requests.
resolved Mar 10, 2026, 06:17 PM UTC

Status: Resolved Flux Schnell requests are being served normally.

Read the full incident report →

Minor February 20, 2026

Model Predictions Stuck at "Starting"

Detected by Pingoru: Feb 20, 2026, 01:12 PM UTC
Resolved: Feb 20, 2026, 01:12 PM UTC
Duration: —

Affected: Streaming APIHTTP APIOfficial Models

Timeline · 3 updates

investigating Feb 20, 2026, 12:13 PM UTC

Status: Investigating We are currently investigating why a large number of models are not currently processing requests, with predictions stalled with a "starting" status. Affected components Streaming API (Partial outage) HTTP API (Partial outage) Official Models (Partial outage)
monitoring Feb 20, 2026, 12:59 PM UTC

Status: Monitoring We have identified the root cause and have made an update. We are continuing to monitor as we start to see things improve. Affected components Streaming API (Degraded performance) HTTP API (Degraded performance) Official Models (Degraded performance)
resolved Feb 20, 2026, 01:12 PM UTC

Status: Resolved Models are once again operational Affected components Streaming API (Operational) HTTP API (Operational) Official Models (Operational)