Factorial HR experienced a major incident on September 9, 2020 affecting API & backend and Factorial website, lasting —. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- resolved Sep 09, 2020, 02:15 PM UTC
Majour outage of all our services except the blog from 14:39 to 15:43.
- postmortem Sep 09, 2020, 02:16 PM UTC
# What happened? At 14:39 new content for our public pages was deployed, causing our cache to hit its maximum capacity limit. This event triggered a fallback strategy: we start requesting a third-party service to serves us the content for our public pages. This third-party service quickly became overwhelmed with requests and started applying an exponential backoff strategy, forcing our backend services to wait long periods of time in order to get a response, and thus making our API unresponsive. # How did we solve it? Increasing the maximum capacity limit of our cache fixed the issue. # How are we gonna make sure it does not happen again? We are gonna review our cache strategy, so that our whole infrastructure does not depend on it in order to properly function.