Noko incident

Freckle is down and not loading

Noko experienced a notice incident on October 28, 2015, lasting —. The incident has been resolved; the full update timeline is below.

Started: Oct 28, 2015, 01:02 PM UTC
Resolved: Oct 28, 2015, 01:02 PM UTC
Duration: —
Detected by Pingoru: Oct 28, 2015, 01:02 PM UTC

Update timeline

resolved Oct 28, 2015, 01:02 PM UTC

Freckle was down and not loading from 5:54am UTC to 11:14am UTC. First of all, we're extremely sorry that this happened. This sucks, and it definitely should not happen. Freckle is back to working order and we've identified and resolved the issue. We will provide a detailed post-mortem analysis.
postmortem Jul 31, 2018, 05:04 PM UTC

Today, October 28, 2015, **Freckle experienced an outage of approximately 5 hours and 20 minutes**, between 5:54am UTC and 11:14am UTC. First of all, **this sucks and we’re extremely sorry that it happened.** Freckle is supposed to save you time—not make you bite your fingernails in anxiety and frantically refresh to see if it’s back up again. ![chart](http://cl.ly/image/3y2s3o1Y2d1B/Screen%20Shot%202015-10-28%20at%209.06.35%20AM.png) Here’s the sequence of events as they happened: 1. **There was a scheduled emergency reboot for one of our servers** from our hosting provider Rackspace, which was scheduled to occur between 3AM UTC and 5AM UTC. We've been informed about this by Rackspace a few days ago, and set up a scheduled maintenance notice on our status page (this reboot was to apply a critical security patch to Rackspace's hosting infrastructure). **Part of the Freckle Ops team was actively monitoring the servers to make sure the reboot was performed successfully and Freckle continued to work as expected.** _The server that was scheduled to be rebooted did so at approximately 3:30AM UTC without a problem._ 2. **At approximately 5:54 AM UTC, a different server was rebooted as part of a different scheduled infrastructure maintenance.** Rackspace did not notify us that this server would be rebooted at all, and it was rebooted outside of the previously scheduled maintenance window. 3. Unfortunately, **this reboot did not properly configure the networking interfaces for the server**, causing the server to be up and running but unreachable by our other servers. **This is turn caused Freckle to be unavailable and show an error page.** 4. Our monitoring alerted us immediately—however our ops team was not actively monitoring the server because they were unaware of the additional scheduled reboot. _In other words, our on-call staff was actively prepared for the given maintenance schedule, but a different server was rebooted outside of that window (two hours laters when the ops team was asleep)._ **However, the main issue is that we did not receive the alerts that the second server was down immediately. This is 100% our fault, and we’re going to make sure that our ops team is notified of any major server outages via multiple methods.** 6. **As soon as alerts were received by our ops team, we immediately contacted Rackspace's support to investigate and resolve the outage.** They were able to restore network connectivity for the server, and Freckle, and the Freckle apps are now fully up and running again. **Rackspace has committed to investigate why we weren't informed of the reboot.** The have confimed that we should have been informed (_"…the fact of the matter is that you were not made aware that this server was going to be impacted by the reboots today in the ORD datacenter"_). We will update this post-mortem when we hear back from them. **Again, this sucks, and we’re extremely sorry about the downtime.** We're proud of being able to deliver high availability of our hosted software to our customers, with most months at 100% or 99.99% availability—but **we completely failed to do so today.** We’re actively addressing how our ops team is notified of server outages, to ensure we will respond as quickly as possible. Thomas Fuchs Co-Founder and CTO, Freckle Time Tracking