Castle incident

API Downtime

Notice Resolved View vendor source →

Castle experienced a notice incident on September 9, 2021, lasting —. The incident has been resolved; the full update timeline is below.

Started
Sep 09, 2021, 02:43 PM UTC
Resolved
Sep 06, 2021, 08:04 PM UTC
Duration
Detected by Pingoru
Sep 09, 2021, 02:43 PM UTC

Update timeline

  1. resolved Sep 09, 2021, 02:43 PM UTC

    At 2021-09-06 20:04 UTC we experienced an AWS hardware failure with one of our main databases which led to 7 minutes of downtime impacting our APIs. During this time, the APIs were returning a 500 response code and no data was processed. The database in question is configured to be multi-node with automatic failover, but for unknown reasons the failover didn't happen as expected when the hardware fault occurred. Instead, a full backup had to be recreated, which led to the extended period of downtime. We're currently debugging this with AWS support to make sure we can trust the resiliency of our platform. While the current setup should provide good redundancy, we're simultaneously looking into alternative options to prevent this from happening again.