Stanford University Outage History

Major April 19, 2026

Outage detected on status page

Detected by Pingoru: Apr 19, 2026, 08:44 PM UTC
Resolved: Apr 20, 2026, 03:31 PM UTC
Duration: 18h 47m

Affected: Overall

Read the full incident report →

Minor February 5, 2026

Investigating potential scheduling delays

Detected by Pingoru: Feb 05, 2026, 08:40 PM UTC
Resolved: Feb 20, 2026, 01:26 AM UTC
Duration: 14d 4h

Affected: Slurm controller

Timeline · 5 updates

investigating Feb 05, 2026, 08:40 PM UTC

We’re currently investigating some scheduling delays with jobs that have recently been submitted on Sherlock. Under certain circumstances, jobs may take longer to be dispatched and wait in queue for longer than usual. All jobs will eventually start, so we recommend keeping them in queue and to avoid cancelling jobs (re-submitting them later will only put them back at the end of the line). We’re working with the scheduler support and development teams on this incident, and will post updates when we have them.
investigating Feb 07, 2026, 01:18 AM UTC

The scheduling delays are still being investigated. As mentioned initially, all jobs eventually get execute, so no action is required on the user part, besides a little bit more patience than usual. We’re aware of the trouble this may cause, and are working with the scheduler developers to identify the problem and find a path to resolution.
identified Feb 11, 2026, 01:29 AM UTC

Work continues with the scheduler developers on this issue, and good progress is being made. A likely source of the scheduling delays has been identified, and we are now working on validating possible workarounds, before a fix can be developed, tested and deployed. As a reminder, all jobs will eventually start, so no action is required on your part. We appreciate your patience and will continue to post updates as we approach final resolution.
monitoring Feb 17, 2026, 08:56 PM UTC

The root cause of the potential scheduling delays reported earlier has been identified as a bug that caused the job scheduler to make inefficient decisions on systems where many jobs request licenses (like Sherlock), resulting in jobs waiting longer than expected to start. The workaround currently in place has been validated, and scheduling is back to normal: no further delays are being observed. We are keeping this issue open until an official fix is released upstream and deployed on Sherlock.
resolved Feb 20, 2026, 01:26 AM UTC

A fix addressing the root cause of the scheduling delays has been deployed. Job dispatch times have returned to normal, and the issue is now resolved. We appreciate users’ patience while we worked with the Slurm development team to identify and address the problem.

Read the full incident report →

Minor July 22, 2025

`/scratch` file system is unresponsive

Detected by Pingoru: Jul 22, 2025, 09:07 PM UTC
Resolved: Jul 22, 2025, 11:37 PM UTC
Duration: 2h 30m

Affected: $SCRATCH

Timeline · 4 updates

investigating Jul 22, 2025, 09:07 PM UTC

The /scratch filesystem (which serves $SCRATCH and $GROUP_SCRATCH) is experiencing some issues. Symptoms include hanging commands and non-responsive access when trying to access anything under /scratch. We’re currently investigating, and we’ll post updates as they become available
investigating Jul 22, 2025, 09:48 PM UTC

We’re still investigating the issue, and are working on restoring access to /scratch as soon as possible.
monitoring Jul 22, 2025, 10:51 PM UTC

The /scratch file system should be back up and running normally. Processes that were stuck on I/O should have resumed automatically, but in case applications reported explicit errors, feel free to resubmit those jobs or restart those processes. And to reach out to [email protected] if you have any questions.
resolved Jul 22, 2025, 11:37 PM UTC

The issue has been resolved. We’ll keep an eye on things, but we’re confident this incident can be closed now.

Read the full incident report →