DRMtoday incident

DRMtoday Production: Increased latency in us-west-1

Minor Resolved View vendor source →

DRMtoday experienced a minor incident on October 16, 2018 affecting License Delivery, lasting 2h 1m. The incident has been resolved; the full update timeline is below.

Started
Oct 16, 2018, 12:51 AM UTC
Resolved
Oct 16, 2018, 02:53 AM UTC
Duration
2h 1m
Detected by Pingoru
Oct 16, 2018, 12:51 AM UTC

Affected components

License Delivery

Update timeline

  1. investigating Oct 16, 2018, 12:51 AM UTC

    Due to health check failures in us-west-1, since 0:11 UTC all license deliveries in region us-west-1 are being routed to nearby regions which leads to increased latency for these requests. We don't see signs of failing license deliveries and are investigating the cause for the health check failures.

  2. identified Oct 16, 2018, 01:01 AM UTC

    Update 02:41 UTC - This is an unrelated issue. The AWS service health dashboard states: 05:50 PM PDT We are investigating connectivity issues for some domains in a single Availability Zone in the US-WEST-1 Region.

  3. resolved Oct 16, 2018, 02:53 AM UTC

    All systems are back to normal and licenses are now delivered from all DRMtoday regions. Timeline 00:06 - Backend nodes in us-west-1 lose connectivity to a backend database 00:06 - Health checks fail and all traffic to region us-west-1 is redirected to nearby regions 00:12 - DRMtoday's ops team is notified 00:21 - The offending database node is automatically shut down due to an earlier error. Unfortunately the usual failover/recovery fails. 00:55 - Database fully recovered 02:30 - DRMtoday's ops team reenabled deliveries from us-west-1 All times UTC We apologize for the inconvenience and will continue our investigation into the failover behavior.