Northflank incident

Infrastructure stability issues

Minor Resolved View vendor source →

Northflank experienced a minor incident on May 21, 2025, lasting —. The incident has been resolved; the full update timeline is below.

Started
May 21, 2025, 04:04 PM UTC
Resolved
May 21, 2025, 04:04 PM UTC
Duration
Detected by Pingoru
May 21, 2025, 04:04 PM UTC

Update timeline

  1. resolved May 21, 2025, 04:04 PM UTC

    Type: Incident Duration: 1 hour and 10 minutes Affected Components: Networking, Addons, Services May 21, 16:04:00 GMT+0 - Monitoring - May 21, 17:13:33 GMT+0 - Resolved - **We sincerely apologise for the outage you experienced.** **What happened:** Yesterday, we rolled out an update to support ARM architecture across our platform. As part of that, a migration was initiated today to extend support to x86 architecture within our database schemas. This triggered a higher-than-expected volume of database rescheduling events, which in turn put excessive pressure on the underlying GCP infrastructure. Some nodes failed under this load, compounding the issue and causing degraded service for several customers. **Timeline of events (UTC):** * **15:36** — Northflank engineers were alerted to failing health checks on customer endpoints. * **15:41** — Elevated 503 errors were detected from a specific cluster. * **15:42** — An internal incident was formally declared. * **15:57** — Replacement infrastructure began provisioning to recover failed nodes. * **16:32** — All customer addons returned to a healthy state, with the exception of four databases which required manual intervention. Thank you for your patience—we know how critical uptime is, and we are reviewing and improving our systems to ensure this doesn’t happen again. Will Stewart CEO @ Northflank