inSided experienced a critical incident on September 13, 2023 affecting Status of our EMEA Community Infrastructure and Status of our US Community Infrastructure, lasting 1h 22m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Sep 13, 2023, 02:44 PM UTC
We are currently investigating this issue.
- identified Sep 13, 2023, 02:47 PM UTC
The issue has been identified and a fix is being implemented.
- monitoring Sep 13, 2023, 03:11 PM UTC
[EU+US] Communities should be coming back online. We will be monitoring the status.
- resolved Sep 13, 2023, 04:06 PM UTC
This incident has been resolved.
- postmortem Sep 26, 2023, 10:20 AM UTC
### September 13th - Platform Outage \(all regions\) RCA **Issue** Every user accessing the platform community pages were greeted with an error message instead of the requested community page. This then resulted in a situation where no community functions could be performed or the content viewed which then corresponded to a complete global outage across both regions. **Cause** A code change to the platform related to another issue was deployed which then caused this issue to occur. In short; a code comment was not being parsed correctly by the production server and as a result it created a fatal front end error for every customer and community across the EU and US region. Why did our automated tests not pick this up? * End to End tests are normally run against a Test Instance, which did not contain the problematic change. * We identified insufficient automatic tests/lints against these particular front end templates \(tests that would have picked this issue up\) and as a result this issue managed to slip through and be deployed to the production servers. **Resolution** The issue was swiftly mitigated by reverting the code deploy that was causing the issue, however this led to a 27 minute period where the communities were not available and this error message displayed to all users instead. **Prevention steps** * Add linter for template files to be able to sense these rendering issues in future - ensure that we are receiving a status 200 \(ok\) code for page loads. * Additional health checks for most critical staging resources * Standard 5-10 checks on the most critical pages to ensure that the correct response codes are being received back before giving the option to approve for production * Add these checks to the current deployment pipeline that is currently being worked on * Reinforce checking staging consistently before approving deployment to production * More stringent visual checks e.g on staging before the button for production deployment is pressed. * More stringent logging checks e.g on staging before the button for production deployment is pressed. **Timeline** * 16:42 this was escalated to critical incident internally and a status page update was posted * 16:43 the incident was recognised by engineering and the process for initiating a rollback was then started immediately * 16:52 the rollback was triggered once the specific deployment causing the issue was located * 17:07 EU rollback completed and communities were back online * 17:09 US rollback completed and communities were back online