Stanford University Outage History
Stanford University is up right nowStanford University had 3 outages in the last 2 years totaling 343h 17m of downtime — averaging 0.1 incidents per month.
There were 3 Stanford University outages since July 22, 2025 totaling 343h 17m of downtime. Each is summarised below — incident details, duration, and resolution information.
Investigating potential scheduling delays
Timeline · 5 updates
- investigating Feb 05, 2026, 08:40 PM UTC
We’re currently investigating some scheduling delays with jobs that have recently been submitted on Sherlock. Under certain circumstances, jobs may take longer to be dispatched and wait in queue for longer than usual. All jobs will eventually start, so we recommend keeping them in queue and to avoid cancelling jobs (re-submitting them later will only put them back at the end of the line). We’re working with the scheduler support and development teams on this incident, and will post updates when we have them.
- investigating Feb 07, 2026, 01:18 AM UTC
The scheduling delays are still being investigated. As mentioned initially, all jobs eventually get execute, so no action is required on the user part, besides a little bit more patience than usual. We’re aware of the trouble this may cause, and are working with the scheduler developers to identify the problem and find a path to resolution.
- identified Feb 11, 2026, 01:29 AM UTC
Work continues with the scheduler developers on this issue, and good progress is being made. A likely source of the scheduling delays has been identified, and we are now working on validating possible workarounds, before a fix can be developed, tested and deployed. As a reminder, all jobs will eventually start, so no action is required on your part. We appreciate your patience and will continue to post updates as we approach final resolution.
- monitoring Feb 17, 2026, 08:56 PM UTC
The root cause of the potential scheduling delays reported earlier has been identified as a bug that caused the job scheduler to make inefficient decisions on systems where many jobs request licenses (like Sherlock), resulting in jobs waiting longer than expected to start. The workaround currently in place has been validated, and scheduling is back to normal: no further delays are being observed. We are keeping this issue open until an official fix is released upstream and deployed on Sherlock.
- resolved Feb 20, 2026, 01:26 AM UTC
A fix addressing the root cause of the scheduling delays has been deployed. Job dispatch times have returned to normal, and the issue is now resolved. We appreciate users’ patience while we worked with the Slurm development team to identify and address the problem.
`/scratch` file system is unresponsive
Timeline · 4 updates
- investigating Jul 22, 2025, 09:07 PM UTC
The /scratch filesystem (which serves $SCRATCH and $GROUP_SCRATCH) is experiencing some issues. Symptoms include hanging commands and non-responsive access when trying to access anything under /scratch. We’re currently investigating, and we’ll post updates as they become available
- investigating Jul 22, 2025, 09:48 PM UTC
We’re still investigating the issue, and are working on restoring access to /scratch as soon as possible.
- monitoring Jul 22, 2025, 10:51 PM UTC
The /scratch file system should be back up and running normally. Processes that were stuck on I/O should have resumed automatically, but in case applications reported explicit errors, feel free to resubmit those jobs or restart those processes. And to reach out to [email protected] if you have any questions.
- resolved Jul 22, 2025, 11:37 PM UTC
The issue has been resolved. We’ll keep an eye on things, but we’re confident this incident can be closed now.