Font Awesome incident

Downtime for kit.fontawesome.com

Font Awesome experienced a notice incident on February 28, 2021 affecting kit.fontawesome.com, lasting 23m. The incident has been resolved; the full update timeline is below.

Started: Feb 28, 2021, 07:41 AM UTC
Resolved: Feb 28, 2021, 08:04 AM UTC
Duration: 23m
Detected by Pingoru: Feb 28, 2021, 07:41 AM UTC

Affected components

kit.fontawesome.com

Update timeline

investigating Feb 28, 2021, 07:41 AM UTC

At approximately 1:05 Central Time we began seeing alerts from our monitoring systems indicating downtime for kit.fontawesome.com. Upon investigating we saw that our Redis database cluster was experiencing a failover event which led to about 6 minutes of downtime for the service as a replica was being promoted to primary. Systems self-healed around this issue and came back online without operator intervention. We'll be investigating the root cause of this over the next few days.
resolved Feb 28, 2021, 08:04 AM UTC

This incident has been resolved.
postmortem Mar 03, 2021, 05:19 PM UTC

At approximately 1:05 a.m. Central, the master Redis database which is hosted by [Sendgrid.io](http://Sendgrid.io) reported a failover event. This event indicated that the primary Redis instance was no longer serviceable and that the secondary instance was being promoted. The Kits service is deployed in 7 different geographical regions around the globe. Each region has multiple application servers and each of those has a replica of the primary Redis database. While the service was designed to handle intermittent disconnection to the primary Redis database it looks like the promotion after a failover event caused the replicas to go offline. Once the Redis replica went offline for a particular region, our monitoring and disaster recovery tools begin trying to work around this situation. We use Nomad for scheduling jobs and after health checks started failing that tool restarted the job. Without any intervention from our operations team the service came back online 6 minutes after it went offline. During the downtime some requests would succeed as the Cloudflare cache still held valid cache records for some resources. Our team has identified that Redis failover events are not handled in the most ideal way. Optimally, the distributed Redis replicas would continue operating until the new primary Redis database is elected and takes over for the old one. If you have any questions please feel free to email us at [email protected].