pool.ntp.org incident

manage.ntppool.org outage

Major Resolved View vendor source →

pool.ntp.org experienced a major incident on May 8, 2021 affecting Management Portal and Public website and 1 more component, lasting 1d 15h. The incident has been resolved; the full update timeline is below.

Started
May 08, 2021, 08:00 AM UTC
Resolved
May 09, 2021, 11:10 PM UTC
Duration
1d 15h
Detected by Pingoru
May 08, 2021, 08:00 AM UTC

Affected components

Management PortalPublic websiteDNS updates

Update timeline

  1. investigating May 08, 2021, 05:37 PM UTC

    The ceph storage system hung this morning; the system is slowly recovering (hopefully -- it is still a bit choppy). Monitoring isn't working, but all DNS services and the NTP service is unaffected or only minimally impacted for now.

  2. identified May 09, 2021, 05:25 AM UTC

    The system has been up and down through the day. We continue to have trouble with the Ceph system locking up. :-/

  3. monitoring May 09, 2021, 09:31 PM UTC

    The system should be stable again. It took a while (to put it mildly), but we appear to have tracked the trouble down to a recent `runc` upgrade that'd sometimes get stuck when starting containers. The trouble with had with ceph was just a side effect of this. In the process of debugging all this the ceph configuration has been made much more durable. 🤞🏻

  4. resolved May 09, 2021, 11:10 PM UTC

    All has been stable since we downgraded `runc`. We think it was this issue -- https://github.com/opencontainers/runc/issues/2865