Castle incident

Service disruption

Minor Resolved View vendor source →

Castle experienced a minor incident on April 4, 2021 affecting Legacy APIs, lasting 29m. The incident has been resolved; the full update timeline is below.

Started
Apr 04, 2021, 01:56 PM UTC
Resolved
Apr 04, 2021, 02:26 PM UTC
Duration
29m
Detected by Pingoru
Apr 04, 2021, 01:56 PM UTC

Affected components

Legacy APIs

Update timeline

  1. investigating Apr 04, 2021, 02:06 PM UTC

    We’re experiencing timeouts in API endpoints. Investigating

  2. investigating Apr 04, 2021, 02:13 PM UTC

    API endpoints responding normally again. Queued requests are catching up. Monitoring

  3. resolved Apr 04, 2021, 02:26 PM UTC

    System is back to normal. We will follow up with more details about this incident

  4. postmortem Apr 08, 2021, 02:28 PM UTC

    On Sunday, April 4th, 2021, beginning at 13:56 UTC, Castle's `/authenticate` endpoint was unavailable. Our teams promptly responded and service was restored at 14:09 UTC. We've conducted a full retrospective and root-cause analysis and determined that the original cause of the incident was the hardware failure \(as confirmed by AWS Support\) of an AWS host instance that contained Castle's managed cache service. This hardware failure caused an accumulation of timeouts, resulting in some app instances being marked unhealthy and automatically restarted in a loop. Although rare, we do expect occasional hardware-level failures, and our system is designed to be resilient to these failures whenever possible. In this case, the accumulated timeouts caused the system to behave in a way we have not seen before. We have re-prioritized our engineering team to implement '[circuit breaker](https://martinfowler.com/bliki/CircuitBreaker.html)'-style handling around cache look-ups which will prevent subsequent cache layer failures from impacting synchronous endpoints like `/authenticate`.