Stanford University Outage History

Stanford University is up right now

Stanford University had 3 outages in the last 2 years totaling 343h 17m of downtime — averaging 0.1 incidents per month.

There were 3 Stanford University outages since July 22, 2025 totaling 343h 17m of downtime. Each is summarised below — incident details, duration, and resolution information.

Source: https://status.sherlock.stanford.edu

Minor February 5, 2026

Investigating potential scheduling delays

Detected by Pingoru
Feb 05, 2026, 08:40 PM UTC
Resolved
Feb 20, 2026, 01:26 AM UTC
Duration
14d 4h
Affected: Slurm controller
Timeline · 5 updates
  1. investigating Feb 05, 2026, 08:40 PM UTC

    We’re currently investigating some scheduling delays with jobs that have recently been submitted on Sherlock. Under certain circumstances, jobs may take longer to be dispatched and wait in queue for longer than usual. All jobs will eventually start, so we recommend keeping them in queue and to avoid cancelling jobs (re-submitting them later will only put them back at the end of the line). We’re working with the scheduler support and development teams on this incident, and will post updates when we have them.

  2. investigating Feb 07, 2026, 01:18 AM UTC

    The scheduling delays are still being investigated. As mentioned initially, all jobs eventually get execute, so no action is required on the user part, besides a little bit more patience than usual. We’re aware of the trouble this may cause, and are working with the scheduler developers to identify the problem and find a path to resolution.

  3. identified Feb 11, 2026, 01:29 AM UTC

    Work continues with the scheduler developers on this issue, and good progress is being made. A likely source of the scheduling delays has been identified, and we are now working on validating possible workarounds, before a fix can be developed, tested and deployed. As a reminder, all jobs will eventually start, so no action is required on your part. We appreciate your patience and will continue to post updates as we approach final resolution.

  4. monitoring Feb 17, 2026, 08:56 PM UTC

    The root cause of the potential scheduling delays reported earlier has been identified as a bug that caused the job scheduler to make inefficient decisions on systems where many jobs request licenses (like Sherlock), resulting in jobs waiting longer than expected to start. The workaround currently in place has been validated, and scheduling is back to normal: no further delays are being observed. We are keeping this issue open until an official fix is released upstream and deployed on Sherlock.

  5. resolved Feb 20, 2026, 01:26 AM UTC

    A fix addressing the root cause of the scheduling delays has been deployed. Job dispatch times have returned to normal, and the issue is now resolved. We appreciate users’ patience while we worked with the Slurm development team to identify and address the problem.

Read the full incident report →

Minor July 22, 2025

`/scratch` file system is unresponsive

Detected by Pingoru
Jul 22, 2025, 09:07 PM UTC
Resolved
Jul 22, 2025, 11:37 PM UTC
Duration
2h 30m
Affected: $SCRATCH
Timeline · 4 updates
  1. investigating Jul 22, 2025, 09:07 PM UTC

    The /scratch filesystem (which serves $SCRATCH and $GROUP_SCRATCH) is experiencing some issues. Symptoms include hanging commands and non-responsive access when trying to access anything under /scratch. We’re currently investigating, and we’ll post updates as they become available

  2. investigating Jul 22, 2025, 09:48 PM UTC

    We’re still investigating the issue, and are working on restoring access to /scratch as soon as possible.

  3. monitoring Jul 22, 2025, 10:51 PM UTC

    The /scratch file system should be back up and running normally. Processes that were stuck on I/O should have resumed automatically, but in case applications reported explicit errors, feel free to resubmit those jobs or restart those processes. And to reach out to [email protected] if you have any questions.

  4. resolved Jul 22, 2025, 11:37 PM UTC

    The issue has been resolved. We’ll keep an eye on things, but we’re confident this incident can be closed now.

Read the full incident report →