Brillium incident

Authentication Errors Affecting Some Brillium v11 Users

Minor Resolved View vendor source →

Brillium experienced a minor incident on September 17, 2021 affecting User Administration and Authentication and Assessment Authoring, lasting 1d 17h. The incident has been resolved; the full update timeline is below.

Started
Sep 17, 2021, 04:37 AM UTC
Resolved
Sep 18, 2021, 10:34 PM UTC
Duration
1d 17h
Detected by Pingoru
Sep 17, 2021, 04:37 AM UTC

Affected components

User Administration and AuthenticationAssessment Authoring

Update timeline

  1. investigating Sep 17, 2021, 04:37 AM UTC

    We are investigating a possible reoccurance of the previous issue that prevents authors, administrators and assessment test takers from accessing a small number of accounts.

  2. identified Sep 17, 2021, 04:44 AM UTC

    We are looking at a possible issue that could be caused by the web server software itself. The development team is working on a fix that will be tested. While this work is underway, you may see a maintenance page when accessing Assessment Builder or the Administration portal (if applicable). There is currently no estimated completion time, but we will post more information as soon as we have it.

  3. identified Sep 17, 2021, 04:44 AM UTC

    We are continuing to work on a fix for this issue.

  4. identified Sep 17, 2021, 05:16 AM UTC

    We are continuing to work on a fix for this issue. This is only affecting a small number of users.

  5. monitoring Sep 17, 2021, 07:25 AM UTC

    a fix is being implemented and monitoring will begin.

  6. monitoring Sep 17, 2021, 04:55 PM UTC

    While continuing our investigations of the performance challenges experienced by a small number of our customers, we have narrowed the issue down to the activity related to a small number customers located in the Asia Pacific region. We are working on addressing the impact and will post further updates as we make progress.

  7. monitoring Sep 17, 2021, 07:41 PM UTC

    Mitgation efforts appear to be successful. We will continue monitoring systems. All components are operating at nominal levels.

  8. resolved Sep 18, 2021, 10:34 PM UTC

    All monitoring and tests have completed successfully.

  9. postmortem Oct 22, 2021, 01:11 AM UTC

    ## Findings Although extremely limited in affect and scope, this issue was caused by a combination of unexpected and unpredictable factors that included assessment configuration, atypical respondent activity, and to an extent, local government policy. On September 16, 2021, we began to receive reports from approximately 30 - 40 customers that were impacted by extremely unusual assessment activity patterns, that originated from customers delivering assessments in a region of Asia, related to a large-scale government economic recovery program. The program required qualifying citizens to complete several assessments in order to receive funds. The population of this region is significant, as were the number of qualifying individuals. The benefits offered through the program resulted in highly unusual activity, with some individuals retaking a single assessment up to 100 times or more, completing hundreds \(in some cases over a thousand\) assessment attempts per-day per-person. The configuration of the assessments was such that it further increased the internal API call volume. The combination of these factors caused the affected Brillium customers to encounter errors when accessing their accounts and delivering assessments. ## Resolution Upon analysis and review, Brillium staff and development teams took the following steps to address these issues. Subsequent acitivty with similar volumes of attempts took place without incident in the Fays that followed. The following steps were taken: * Brillium staff consulted and advised customers located in the source regions of the assessment activity, on improved assessment configuration, resulting in a reduction of unnecessary activity. * Brillium’s development team made significant improvements to the Brillium API, resulting in improved efficiency and increased capacity across all global regions. * Enhanced system metrics and increased monitoring ## Additional Monitoring Although the impact was eliminated within a short time, the development team has monitored systems for an additional 30 days in order to collect and analyze additional assessment and system performance data. The resulting analysis shows that all implemented fixes have adequately increased system performance and eliminated related errors.