HUIT incident

Open OnDemand Service Interruption

Notice Resolved View vendor source →

HUIT experienced a notice incident on April 16, 2025 affecting Other Services, lasting 16h 33m. The incident has been resolved; the full update timeline is below.

Started
Apr 16, 2025, 09:12 PM UTC
Resolved
Apr 17, 2025, 01:46 PM UTC
Duration
16h 33m
Detected by Pingoru
Apr 16, 2025, 09:12 PM UTC

Affected components

Other Services

Update timeline

  1. investigating Apr 16, 2025, 09:12 PM UTC

    When logging into https://ood.huit.harvard.edu, users are able to load the Open OnDemand dashboard, but interactive apps will not start properly. The terminal app may load, but the Slurm scheduler is unstable, and compute jobs may or may not run. User data is unaffected, and can still be downloaded through the Open OnDemand dashboard. FAS Academic Technology is troubleshooting this issue and working to restore this service.

  2. identified Apr 16, 2025, 10:45 PM UTC

    The service team believes they've identified a path forward. They continue working to investigate and remediate the root cause.

  3. monitoring Apr 16, 2025, 11:44 PM UTC

    Restarting the slurm controller node and altering its configuration has enabled launching interactive apps in HUIT Open OnDemand once again. HUIT will continue to monitor the service to ensure stability before resolving the Major Incident.

  4. resolved Apr 17, 2025, 01:46 PM UTC

    The outage affecting HUIT Open On Demand has been resolved, and users are able to login and access resources. We'll continue to closely monitor to ensure service stability.