Qwilr incident

Server issues affecting the Qwilr app

Qwilr experienced a major incident on June 11, 2019 affecting Qwilr App, lasting 1h 45m. The incident has been resolved; the full update timeline is below.

Started: Jun 11, 2019, 12:46 AM UTC
Resolved: Jun 11, 2019, 02:32 AM UTC
Duration: 1h 45m
Detected by Pingoru: Jun 11, 2019, 12:46 AM UTC

Affected components

Qwilr App

Update timeline

investigating Jun 11, 2019, 12:46 AM UTC

It looks like we're currently having some server issues and we're investigating the cause. You may see errors loading the app or slow load times.
identified Jun 11, 2019, 01:32 AM UTC

The server issue have been resolved. We will continue to monitor the situation.
monitoring Jun 11, 2019, 01:33 AM UTC

The server issue have been resolved. We will continue to monitor the situation.
identified Jun 11, 2019, 02:07 AM UTC

We're currently experiencing server issues, and we're investigating. You may see errors loading the app or slow load times.
identified Jun 11, 2019, 02:32 AM UTC

We are continuing to work on a fix for this issue.
resolved Jun 11, 2019, 02:32 AM UTC

The server issue has been resolved.
postmortem Jun 19, 2019, 05:22 AM UTC

On Tuesday June 11th at approximately 10.30am AEST, Qwilr experienced serious issues with site reliability and many users experienced failures in using the application and delivery of content to customers. Qwilr’s engineering team investigated the issue and observed spikes in CPU on some of our webserver instances, but nothing that should cause the 502 and 504 errors customers reported. Eventually we could observe that some of our NodeJS docker Pods \(we run in Kubernetes\) were hitting 100% CPU and with further investigation could see that these processes were taking up to 30 minutes to process a single request. The cause of this turned out to be a very large payload sent to our API, causing that request to take up to 30 minutes. Part of this was a result of having code that was designed to run fast for small payloads but didn’t handle this large payload. It filled up the memory allocated to the Pod and caused the CPU to go to 100%. Combined with this, as a consequence of recently moving infrastructure from Rackspace to AWS, our Kubernetes Pods lacked readiness checks that would ensure traffic not be routed to them when not responsive. This meant requests to these Pods would time out and return 502 or 504s. By 6pm AEST on the 11th, we deployed a code fix to resolve the root cause and ensure that these Pods could process such a large payload in approximately 1/10th of the time and also set up a readiness check to ensure our system is more robust. We are also working with our API customers to find a sensible limit to payload sizes. As a result of this issue we are confident that our system has been made more stable and resilient for the future.