Merinio incident

Interface does not load consistently

Major Resolved View vendor source →

Merinio experienced a major incident on February 19, 2021 affecting Web Application, lasting 3h 10m. The incident has been resolved; the full update timeline is below.

Started
Feb 19, 2021, 08:11 PM UTC
Resolved
Feb 19, 2021, 11:22 PM UTC
Duration
3h 10m
Detected by Pingoru
Feb 19, 2021, 08:11 PM UTC

Affected components

Web Application

Update timeline

  1. investigating Feb 19, 2021, 08:11 PM UTC

    We are currently investigating an issue where Merinio responds intermittently and sometimes does not load.

  2. monitoring Feb 19, 2021, 08:36 PM UTC

    A fix has been implemented and we are monitoring the results.

  3. resolved Feb 19, 2021, 11:22 PM UTC

    This incident has been resolved, we will continue to monitor the situation closely.

  4. postmortem Feb 23, 2021, 10:32 PM UTC

    **Background:** On the afternoon of February 19th, 2021, our main API servers began exhibiting significantly elevated error rates coupled with extreme latency. This issue caused widespread unresponsiveness across our infrastructure for about half an hour, severely impacting user experience by preventing app loading and creating substantial delays in page transitions. **Root Cause Analysis:** The initial remedy involved expanding the capacity of our production cluster and rebooting several key services. A thorough investigation of the logs revealed that the request load on our systems had surged by approximately 3000%. This dramatic increase led to escalating latency, culminating in widespread request timeouts. The surge was traced back to a recent modification in the bulk edit tool within Merinio 2.0. This change inadvertently caused the entire user list to refresh on all logged-in devices for each user edit, as opposed to the previous setup where the list refreshed only upon page changes post-resource modification. **Immediate Response:** To mitigate the immediate impact, we have temporarily disabled real-time updates for changes executed through the bulk edit tool. Our team is actively developing a more optimized solution to handle such scenarios efficiently. **Impact:** This incident significantly disrupted operations for our users, leading to a degraded experience. We deeply regret the inconvenience and interruption caused to our users during this period.