BlueFox Host incident

Service outage

BlueFox Host experienced a critical incident on March 30, 2021 affecting Pterodactyl API and Billing API, lasting 2h 49m. The incident has been resolved; the full update timeline is below.

Started: Mar 30, 2021, 07:04 AM UTC
Resolved: Mar 30, 2021, 09:54 AM UTC
Duration: 2h 49m
Detected by Pingoru: Mar 30, 2021, 07:04 AM UTC

Affected components

Pterodactyl APIBilling API

Update timeline

investigating Mar 30, 2021, 07:38 AM UTC

We are currently investigating this issue.
investigating Mar 30, 2021, 07:39 AM UTC

We are continuing to investigate this issue.
investigating Mar 30, 2021, 07:40 AM UTC

An outage on bh008 has caused the downtime we are experiencing, we are working on discovering the root cause.
identified Mar 30, 2021, 07:47 AM UTC

This appears to be related to the data center the service is in, we are waiting for a response from the client about communications with the location experiencing issues.
identified Mar 30, 2021, 09:01 AM UTC

We are still waiting for the provider to contact us.
investigating Mar 30, 2021, 09:20 AM UTC

We have misidentified the cause of this issue and are investigating.
identified Mar 30, 2021, 09:20 AM UTC

We have identified the cause of the issue
identified Mar 30, 2021, 09:33 AM UTC

We are continuing to work on a fix for this issue.
identified Mar 30, 2021, 09:45 AM UTC

We have regained access to the affected services
identified Mar 30, 2021, 09:46 AM UTC

We are working to restore services now.
monitoring Mar 30, 2021, 09:48 AM UTC

A fix has been implemented and we are monitoring the results.
resolved Mar 30, 2021, 09:54 AM UTC

This incident has been resolved.
postmortem Mar 31, 2021, 11:52 PM UTC

#### Leadup Describe the circumstances that led to this incident * An engineer made a typo while modifying a cronjob * The typo lead to Teleport failing to complete it’s certificate rotation on bh008 * There was an issue which required nginx and php7.4-fpm to be restarted * The service was not available for this to be completed * Outage was officially recognised #### Fault Describe what failed to work as expected * Gravitiational/Teleport was not functioning on bh008 * nginx and php7.4-fpm were not functioning as expected on bh008, returning proxy errors #### Detection Describe how the incident was detected * Application monitoring and synthetics detecting and reported the issue to our engineering and emergency response teams on Opsgenie * The ERT saw the people assigned to the account and attempted to contact them * The people on the account were available and took over from the ERT #### Root causes Run a 5-whys analysis to understand the true causes of the incident * Teleport Failure * A bad cronjob was configured * Teleport tried to add a cronjob * The Teleport cluster rotated it’s certificates * The Teleport node added another cronjob * The crontab was invalid, and therefore the certificates were not renewed * nginx/php7.4-fpm failure * An issue was spotted with a web application * A modification was made to php7.4-fpm \(discussion on how/why this occured\) * The service required restarting * The engineer disconnected * The Teleport service was unavailable #### Mitigation and resolution What steps did you take to resolve this incident? After taking over from the ERT, I restored the SSH daemon on bh008 to access the server. Once in, I resolved the issue in the crontab and restarted Teleport. Teleport retrieved their new certificates then came back online, after this I proceeded to kill the ssh daemon, access the host via Teleport and resolved the rest of the issues \(including rebooting both nginx and php7.4-fpm\). #### Lessons learnt What went well? What could have gone better? What else did you learn? * Ensure that configuration changes are staged and tested prior to deployment in a live production environment. * Double-check your work before saving any configurations or vhosts