Digital Pigeon incident

Delays generating previews...

Minor Resolved View vendor source →

Digital Pigeon experienced a minor incident on October 26, 2018 affecting Oceania Transcoding Servers and South East Asia Transcoding Servers and 1 more component, lasting 1h 8m. The incident has been resolved; the full update timeline is below.

Started
Oct 26, 2018, 10:26 PM UTC
Resolved
Oct 26, 2018, 11:34 PM UTC
Duration
1h 8m
Detected by Pingoru
Oct 26, 2018, 10:26 PM UTC

Affected components

Oceania Transcoding ServersSouth East Asia Transcoding ServersUSA West Transcoding ServersEurope Transcoding Servers

Update timeline

  1. investigating Oct 26, 2018, 10:26 PM UTC

    We are investing reports that its taking longer than usual for previews to appear.

  2. monitoring Oct 26, 2018, 10:44 PM UTC

    It appears that there was an issue with the process that manages the workers in the transcoding cluster which caused it to stop raising alarms as the 'time on queue' grew. This in turn stopped it from adding workers to the cluster as needed which slowed down the preview process. We are continuing to investigate the issue with the management cluster to prevent this from happening again in the future, but for now a reboot of the cluster seems to have resolved the issue.

  3. resolved Oct 26, 2018, 11:34 PM UTC

    This incident has been resolved, unfortunately our automated monitoring systems didn't alert us to this issue which meant it took longer than usual for an ops engineer to intervene. The automated monitoring system was configured to raise alarms when it detected long queue times but it was not configured to raise an alarm for missing queue stats (indicating that the process manager had stopped reporting). We've extended our monitoring alarms to alert us if this situation occurs again in the future. Our apologies for the inconvenience.