DRMtoday incident
DRMtoday Production: Region failure ap-northeast-1
DRMtoday experienced a major incident on September 25, 2021 affecting License Delivery, lasting 58m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Sep 25, 2021, 12:21 PM UTC
We were seeing an interruption of the license delivery service in ap-northeast-1. A DNS failover to all other regions was initiated automatically.
- investigating Sep 25, 2021, 12:37 PM UTC
The failover started at 12:05 UTC, following a sharp increase in requests in ap-northeast-1 around 12:03:50. The region and the whole system recovered automatically and became fully operational again at 12:15. During the time of the failover, we observed degraded perfomance.
- resolved Sep 25, 2021, 01:20 PM UTC
The failover was caused by a sudden increase in traffic in the ap-northeast-1 region over the course of a few seconds which caused the running instances to handle too much load. New instances were started automatically but their startup took longer than the increase in traffic, which triggered the failover. The DNS failover spread the requests over multiple regions, which handled the peak with a much slighter increase in load and after the startup of the new instances, ap-northeast-1 recovered in the aforementioned timeframe. As a countermeasure, we increased the number of instances running in ap-northeast-1 which now handle the peaks without degraded performance. A note to the previous status message: We added a small correction in the time. We actually recovered at 12:15 UTC instead of 14:15 (timezone error).