BlueFox Host incident
Performance and reliability issues on bh006
BlueFox Host experienced a major incident on April 28, 2021, lasting 4h 54m. The incident has been resolved; the full update timeline is below.
Update timeline
- investigating Apr 28, 2021, 10:23 PM UTC
We are currently investigating a performance related issue on bh006.
- investigating Apr 28, 2021, 10:24 PM UTC
This incident has escalated into an outage
- identified Apr 28, 2021, 10:24 PM UTC
The issue has been identified and a fix is being implemented.
- identified Apr 28, 2021, 10:43 PM UTC
We are currently stopping all services running on this host.
- monitoring Apr 28, 2021, 10:54 PM UTC
A fix has been implemented and we are monitoring the results.
- resolved Apr 28, 2021, 11:15 PM UTC
This incident has been resolved.
- postmortem Apr 28, 2021, 11:26 PM UTC
## Leadup _Describe the circumstances that led to this incident_ * Reports of poor performance * Local metrics reported an increase in CPU usage overtime * Report from the client specifying issue specifically with bh006 ## Fault _Describe what failed to work as expected_ The [Pterodactyl/[email protected]](https://github.com/pterodactyl/wings/tree/release/v1.4.0) daemon was using up all the CPU resources of the system. Restarting the service and reloading systemctl produced the same outcome. ## Detection _Describe how the incident was detected_ Reported by the client. ## Root causes _Run a 5-whys analysis to understand the true causes of the incident_ * An update was pushed out to [Pterodactyl/Wings](https://github.com/pterodactyl/wings/tree/release/v1.4.0) * The update had an issue causing unordinary resource consumption * The resource consumption issue caused other services to slow * The Wings API failed to respond to requests as it could not keep up * The service was flagged as degraded ## Mitigation and resolution _What steps did you take to resolve this incident?_ * Stop wings * Clear the container states cache * Stop \(or kill\) all containers * Delete all containers * Remove wings from the system * Fresh install of wings * Enable wings on boot * Enable docker on boot * Restart the system * Monitor wings until determined functioning as expected ## Lessons learnt _What went well? What could have gone better? What else did you learn?_ Ensure more in-depth monitoring, and that partial monitoring outages do not effect our response times.