DRMtoday experienced a minor incident on March 9, 2020 affecting License Delivery, lasting 5h 43m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Mar 09, 2020, 02:18 PM UTC
We are currently investigating region failovers.
- investigating Mar 09, 2020, 03:46 PM UTC
The service was fully operational again at 14:23 UTC.
- resolved Mar 09, 2020, 08:01 PM UTC
Root cause: We experienced degraded performance and region failovers in the ap-northeast-1 and us-west-1 regions. The root cause was one failing key lookup instance at 14:05:07 in ap-northeast-1 during peak time. The remaining instance stayed healthy, but wasn't able to handle the load which led to degraded performance as well as failing requests for all drm schemes in ap-northeast-1. At 14:06 the ap-northeast-1 was considered unhealthy and was automatically deactivated. Requests were now routed to the nearest region. This resulted in a cascading effect for us-west-1. In both regions autoscaling of key lookup instances started. As they became healthy the overall health of both regions recovered at 14:20. Mitigation: As an immediate step we doubled the minimum number of key lookup instances in all regions. We apologize for this incident and the inconvenience. Timeline: 14:05:07 - Key lookup instance in ap-northeast-1 fails 14:06 - ap-northeast-1 failover starts 14:08 - DRMtoday ops team was alerted 14:11 - us-west-1 failover starts 14:13 - ap-northeast-1 failover ends 14:13 - ap-northeast-1 failover starts 14:15 - us-west-1 failover ends 14:15 - us-west-1 failover starts 14:18 - ap-northeast-1 failover ends 14:20 - us-west-1 failover ends (All times UTC.)