Wisembly incident

App down

Major Resolved View vendor source →

Wisembly experienced a major incident on June 15, 2015, lasting 43m. The incident has been resolved; the full update timeline is below.

Started
Jun 15, 2015, 02:45 PM UTC
Resolved
Jun 15, 2015, 03:29 PM UTC
Duration
43m
Detected by Pingoru
Jun 15, 2015, 02:45 PM UTC

Update timeline

  1. investigating Jun 15, 2015, 02:45 PM UTC

    We currently have a major app outage. We are investigating and trying our best to make the thins up again.

  2. identified Jun 15, 2015, 02:49 PM UTC

    Application is up now, we restarted crippled processes. We're investigating further to understand what happened and keep you informed. Sorry for the trouble.

  3. resolved Jun 15, 2015, 03:29 PM UTC

    All systems are up again. We'll write shortly a post-mortem explaining what happened, what we've done and how to prevent that to happen again.

  4. postmortem Aug 03, 2018, 05:32 PM UTC

    # 20151506 App down incident #### What happened Today at 04:46:50pm GMT+1 our application went down, for 11 minutes. Our team has been alerted few minutes later by automated probes and error logs. We went into our servers to see the problem. We found an important RAM consumption by our MySQL process, threatening the overall system RAM, preventing REDIS backup daemon to perform its regular backup snapshots. This problem affected some REDIS enabled API endpoints that made our PHP-FPM processes going wild. #### What we did We restarted MySQL process in order clear the buffer cache in RAM and make some room for other processes, especially REDIS automated backups. It was not sufficient, we had to restart PHP-FPM too to cool things down a bit. Once done, all system went green again #### How to avoid that in the future We looked into our server configuration with our backend team and housing provider, and clearly showed that the allowed RAM for MySQL was a bit edgy and could lead to what happened. We reduced so the allowed buffer size by 75%, leaving us enough room to cache all the needed things and leave enough space for all the other processes on the backed servers. We restarted again MySQL to take the configuration into account, and closely monitoring it performances in the near future.