Test IO incident

Network problems and CORS issues

Test IO experienced a notice incident on February 24, 2018, lasting —. The incident has been resolved; the full update timeline is below.

Started: Feb 24, 2018, 01:46 AM UTC
Resolved: Feb 24, 2018, 01:46 AM UTC
Duration: —
Detected by Pingoru: Feb 24, 2018, 01:46 AM UTC

Update timeline

resolved Feb 24, 2018, 01:46 AM UTC

At 2pm (UTC) on Februrary 21st, 2018, network problems at our provider prevented test IO’s servers from properly communicating with each other. After bypassing our load-balancer issues with the CORS configuration prevented some users from using our platform. Everything is running smoothly again since 1am (UTC+1) on February 22nd, 2018.
postmortem Aug 02, 2018, 04:37 PM UTC

#### What happened? Beginning at 2pm (UTC) on February 21st, 2018, network problems at our provider prevented test IO’s servers from properly communicating with each other. After ascertaining the reasons that the servers were not reliably reachable we took action to rectify the issue by simplifying the connection between our servers. At 6pm (UTC+1) we bypassed our load balancer by pointing the domain of our application platform directly to a single, powerful application server to ensure basic functionality of our platform. However, this led to the platform not having authorization for our content delivery network as the load balancer ensures delivery of our CORS headers. Caching of our content resulted in inconsistent outages, where some users had a functioning website but not others. Ultimately, these inconsistencies caused further frustration. Because the problem appeared fixed to those of us with the correct content already cached, we declared the problem fixed. This -- as well as our status page -- was misleading to those for whom our platform was clearly not working. Our status page is only setup to monitor the status of our servers and the ability to connect to them. Therefore it correctly displayed the 53 minutes during which our servers were truly unreachable, but not the continued intervals during which some of you were not able to access the information you wanted. #### How we fixed the problem We realized that some users were entirely blocked from our content delivery network and were not able to access Javascript and stylesheets necessary for site functionality. To fix this, at around 1am (UTC+1) on February 22nd, 2018, we centralized content sourcing directly from our main server through users’ domains, avoiding the cross-domain authorization issue. Since last night, our platform has maintained usability. To return content delivery speeds and functionality to normal, we re-enabled the load balancer and CDN at 12pm (UTC+1) on Thursday Feb 22nd, 2018. Service has since returned to the availability and reliability we know you rely on. #### Apology We strive to provide our customers, testers, and team leaders with a reliable QA testing platform, and we know that the issues with our website yesterday disrupted our customers’ business and prevented them from accomplishing their normal work. For that we sincerely apologize. In order to learn and grow from it, we are taking this opportunity to look at our current processes and improve the reliability of our service. We are currently working with our hosting provider to determine the root cause of the connectivity breakdown between our servers and investigating how we might set up our systems to fail over to another network in the event of a recurrence. We’re also looking at alternative hosting options.