pool.ntp.org experienced a major incident on May 8, 2021 affecting Management Portal and Public website and 1 more component, lasting 1d 15h. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating May 08, 2021, 05:37 PM UTC
The ceph storage system hung this morning; the system is slowly recovering (hopefully -- it is still a bit choppy). Monitoring isn't working, but all DNS services and the NTP service is unaffected or only minimally impacted for now.
- identified May 09, 2021, 05:25 AM UTC
The system has been up and down through the day. We continue to have trouble with the Ceph system locking up. :-/
- monitoring May 09, 2021, 09:31 PM UTC
The system should be stable again. It took a while (to put it mildly), but we appear to have tracked the trouble down to a recent `runc` upgrade that'd sometimes get stuck when starting containers. The trouble with had with ceph was just a side effect of this. In the process of debugging all this the ceph configuration has been made much more durable. 🤞🏻
- resolved May 09, 2021, 11:10 PM UTC
All has been stable since we downgraded `runc`. We think it was this issue -- https://github.com/opencontainers/runc/issues/2865