Brillium incident

502 Gateway Error Reports

Brillium experienced a critical incident on June 13, 2022 affecting API and User Administration and Authentication and 1 more component, lasting 12h 13m. The incident has been resolved; the full update timeline is below.

Started: Jun 13, 2022, 02:39 PM UTC
Resolved: Jun 14, 2022, 02:52 AM UTC
Duration: 12h 13m
Detected by Pingoru: Jun 13, 2022, 02:39 PM UTC

Affected components

APIUser Administration and AuthenticationAssessment AuthoringPartner Central Custom Administration

Update timeline

investigating Jun 13, 2022, 02:39 PM UTC

We are currently investigating 502 Gateway errors reported by a portion of our customers.
investigating Jun 13, 2022, 02:43 PM UTC

The system operations team is investigating whether an operating system level error used by some of the AWS cloud systems is potentially affecting services. Currently, the system monitoring of Brillium services does not indicate an issue that would be unusual. Although this may lay outside of our direct control, the team is investigating any opportunity to mitigate the effect from our side. We continue to investigate.
investigating Jun 13, 2022, 03:03 PM UTC

We are continuing to investigate the issue
investigating Jun 13, 2022, 03:27 PM UTC

The systems operations team is provisioning a possible workaround to the issue, while they continue to investigate. An additional status update will be posted in approximately 15 minutes.
investigating Jun 13, 2022, 04:02 PM UTC

We have implemented a workaround to these issues. The systems are available and we continue to investigate and gather information surrounding the root cause.
investigating Jun 13, 2022, 04:17 PM UTC

We are continuing to investigate this issue.
investigating Jun 13, 2022, 04:19 PM UTC

We are receiving intermittent reports that some users are receiving 503 errors.
investigating Jun 13, 2022, 05:35 PM UTC

The system operations team is currently consulting with AWS engineers to determine the root cause of the issue.
investigating Jun 13, 2022, 06:19 PM UTC

Investigations continue. Currently, it does look like there is an issue with the network routing connecting our systems that is causing sporadic issues. Reports have begun to surface that other Amazon systems are experiencing downtime as well, although we do not know the extent of such issues or the relationship to our specific issue. As these systems are outside of our control, our attempted workaround(s) did not sufficiently address the problem. We are currently assisting engineers in further diagnosis any way that we can, in an attempt to help address the issue in the most expedient way.
investigating Jun 13, 2022, 06:38 PM UTC

Brillium Systems are currently back in service. Our internal tests show the external networking is presently stable. We will continue to monitor.
monitoring Jun 13, 2022, 07:18 PM UTC

We are continuing to monitor activity. Initial results and testing are positive.
resolved Jun 14, 2022, 02:52 AM UTC

Monitoring shows that the issues have been addressed. A full report will be shared via the incident Post-Mortem.
postmortem Jun 19, 2022, 10:08 PM UTC

### Background On June 13 there was an outage event \(very similar to the one on June 8\), occurring between approximately 1300 UTC and 1800 UTC. This event only affected a portion of our customers. Review of our data indicated that specific Amazon AWS cloud service communications between our server systems and authentication resources were failing sporadically during this period, and at times appeared to fail altogether. Our monitoring and system information shows that our systems repeatedly tried to communicate with external resources without success. All systems became operational later in the day. Unlike the previous event on June 8, this one was publicly reported by a few outlets and users on social media. Amazon’s own services appear to have been impacted by the event. ### Steps taken: * We communicated our status and findings, as we learned them, to all customers via our status page * We provided additional details to customers through direct support communications. * We contacted Amazon AWS Support for additional information and confirmation of our findings. * We continued to closely monitor systems for a period after the event, to ensure ongoing stability. ### Public Reports Public reports of the Amazon outage can be found via the links below and these were shared with some customers: * https://www.reuters.com/technology/amazon-down-thousands-users-downdetector-2022-06-13/ * https://knowtechie.com/amazon-is-down-for-a-ton-of-people-right-now/ * https://www.theverge.com/2022/6/13/23166246/amazon-down-error-message-outage ### Mitigation As this event was external to our systems and outside of our control, no direct actions could have predicted or prevented the issues. External network or Internet issues can and will affect access to our systems; however, these types of events are generally rare and often quickly resolved.