4Schools incident

CMS4Schools sites unavailable from some locations

Major Resolved View vendor source →

4Schools experienced a major incident on January 19, 2023 affecting Application, lasting 2h 33m. The incident has been resolved; the full update timeline is below.

Started
Jan 19, 2023, 03:08 PM UTC
Resolved
Jan 19, 2023, 05:41 PM UTC
Duration
2h 33m
Detected by Pingoru
Jan 19, 2023, 03:08 PM UTC

Affected components

Application

Update timeline

  1. investigating Jan 19, 2023, 03:08 PM UTC

    We are receiving reports that CMS4Schools websites are unavailable for some visitors. The issue is not affecting all users and we are investigating to establish who is affected and what the cause may be.

  2. identified Jan 19, 2023, 03:20 PM UTC

    We have identified a server that is causing these problems, have put a temporary fix in place, and are continuing to work toward a proper solution.

  3. resolved Jan 19, 2023, 05:41 PM UTC

    ✅ Our CMS4Schools servers are back in their optimal configuration, and the problem has been resolved. 🔎 If you're interested, here are some nerdy details about what happened: At about 8:35 AM CT, a configuration update was deployed to CMS4Schools servers. This update included an increase in the number of sites to be hosted on those servers. That increase caused the HTTP server to go beyond a filesystem limit enforced by the operating system. As a result, that server stopped serving web content and started returning one of a few different messages. Because the error code was inconsistent, it wasn't affecting all of our visitors or monitoring systems in the same way. Some visitors were unaffected, and some were getting the same error message every time. Once it was clear that this issue was affecting visitors, and not just an internal issue, we opened this Incident at 9:08 AM CT. At 9:18 AM CT we identified and removed the failed server from operation. All customer sites saw immediate relief from the issue. After diagnosing and addressing the filesystem limit, we were able to put that server back in operation at 11:05AM CT. Monitors show normal traffic from that server to our visitors. To address this issue, we will be addressing the root cause by removing a legacy logging mechanism which was causing us to approach the the filesystem limitation unnecessarily. We will also be configuring smarter alerts that can help us spot this kind of issue more quickly.