Digital Pigeon experienced a minor incident on October 26, 2018 affecting Oceania Transcoding Servers and South East Asia Transcoding Servers and 1 more component, lasting 1h 8m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Oct 26, 2018, 10:26 PM UTC
We are investing reports that its taking longer than usual for previews to appear.
- monitoring Oct 26, 2018, 10:44 PM UTC
It appears that there was an issue with the process that manages the workers in the transcoding cluster which caused it to stop raising alarms as the 'time on queue' grew. This in turn stopped it from adding workers to the cluster as needed which slowed down the preview process. We are continuing to investigate the issue with the management cluster to prevent this from happening again in the future, but for now a reboot of the cluster seems to have resolved the issue.
- resolved Oct 26, 2018, 11:34 PM UTC
This incident has been resolved, unfortunately our automated monitoring systems didn't alert us to this issue which meant it took longer than usual for an ops engineer to intervene. The automated monitoring system was configured to raise alarms when it detected long queue times but it was not configured to raise an alarm for missing queue stats (indicating that the process manager had stopped reporting). We've extended our monitoring alarms to alert us if this situation occurs again in the future. Our apologies for the inconvenience.