pool.ntp.org experienced a major incident on July 3, 2025 affecting Management Portal and Public website and 1 more component, lasting 9h 7m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Jul 03, 2025, 02:15 PM UTC
A couple days ago I upgraded the older MySQL cluster to a newer one. It worked fine and then ... DIdn't. The DNS and NTP services continue to operate, but the management website and monitoring is having a complete outage.
- identified Jul 03, 2025, 02:16 PM UTC
The mysql cluster is being reset and a backup from a couple hours ago is being restored.
- monitoring Jul 03, 2025, 06:07 PM UTC
Database has been restored; monitoring the performance.
- resolved Jul 03, 2025, 11:22 PM UTC
This incident has been resolved.
- postmortem Jul 03, 2025, 11:22 PM UTC
I moved the production database to a new version of MySQL earlier in the week \(using the same instance the beta system has been using since sometime last year\). There were some minor hiccups in the process, but I left it in what I thought was a stable happy place. Early early this morning California time the database cluster went into read-only mode. I couldn't get it back in sync when I woke up and saw it. I decided the best cause of action was to clear the cluster and restore the most recent backup \(from a few hours prior\). I deleted the cluster and pointed to the backup, but the \(open source\) tool to restore the backup had a bug that made the restore fail and it took me a little while to learn enough about it to add debugging, deploy a custom build and then fix the bug. Ooof. It was down for almost 6 hours by the time it was up and working again. It was a major database upgrade now completed though, so hopefully there won't be something like it until maybe converting to Postgres at some point in the future.