CyberFOX incident

Unscheduled Database Maintenance

Critical Resolved View vendor source →

CyberFOX experienced a critical incident on April 22, 2022, lasting —. The incident has been resolved; the full update timeline is below.

Started
Apr 22, 2022, 02:45 PM UTC
Resolved
Apr 22, 2022, 02:45 PM UTC
Duration
Detected by Pingoru
Apr 22, 2022, 02:45 PM UTC

Update timeline

  1. resolved Apr 26, 2022, 03:33 PM UTC

    We have resolved the issue.

  2. postmortem Apr 26, 2022, 03:53 PM UTC

    This outage was directly related to the previous outage. As with many complex issues it is never just one thing but typically several things that all are simultaneously happening that converge into this type of event. In the first outage many of our services were up and operational until that later in the night when we manually took it all the way down so that we could fully resolve the issue. Although the original outage lasted longer our services were not totally down because our services are segmented onto multiple APIs for capacity, security, and throughput. The original outage manifested itself as corrupted indexes in the primary database. Our Dev & DevOp teams identified and fixed several things that they believed could have contributed to this issue as well as several things that we should do better to report on and resolve this type of issue more quickly in the future. This outage was much shorter \(44 minutes\) which was in part due to some of the lessons and planning that we experienced the previous week but also revealed that we did not fully resolve the issue. Ultimately there was an additional factor which was the root cause. The contributing factor was the Database capacity. Although our services are dynamically scalable, we have caching, pooling, database followers, segmented APIs, and other mitigations built into our infrastructure the bottom line is that our primary Database needed additional resources. We still have some backend tweaks and adjustments to implement which will further enhance performance, capacity, and throughput which we’ll be looking to implement along the way in the coming months which will ensure we don’t have a repeat of this type of outage