DRMtoday incident

DRMtoday Production: Region failure ap-northeast-1

Major Resolved View vendor source →

DRMtoday experienced a major incident on September 25, 2021 affecting License Delivery, lasting 58m. The incident has been resolved; the full update timeline is below.

Started
Sep 25, 2021, 12:21 PM UTC
Resolved
Sep 25, 2021, 01:20 PM UTC
Duration
58m
Detected by Pingoru
Sep 25, 2021, 12:21 PM UTC

Affected components

License Delivery

Update timeline

  1. investigating Sep 25, 2021, 12:21 PM UTC

    We were seeing an interruption of the license delivery service in ap-northeast-1. A DNS failover to all other regions was initiated automatically.

  2. investigating Sep 25, 2021, 12:37 PM UTC

    The failover started at 12:05 UTC, following a sharp increase in requests in ap-northeast-1 around 12:03:50. The region and the whole system recovered automatically and became fully operational again at 12:15. During the time of the failover, we observed degraded perfomance.

  3. resolved Sep 25, 2021, 01:20 PM UTC

    The failover was caused by a sudden increase in traffic in the ap-northeast-1 region over the course of a few seconds which caused the running instances to handle too much load. New instances were started automatically but their startup took longer than the increase in traffic, which triggered the failover. The DNS failover spread the requests over multiple regions, which handled the peak with a much slighter increase in load and after the startup of the new instances, ap-northeast-1 recovered in the aforementioned timeframe. As a countermeasure, we increased the number of instances running in ap-northeast-1 which now handle the peaks without degraded performance. A note to the previous status message: We added a small correction in the time. We actually recovered at 12:15 UTC instead of 14:15 (timezone error).