Perimeter 81 incident
Sporadic errors with the Web Management Console
Perimeter 81 experienced a major incident on January 16, 2022, lasting 1h 20m. The incident has been resolved; the full update timeline is below.
Update timeline
- identified Jan 16, 2022, 11:59 AM UTC
We are currently experiencing an issue with our web management console. The issue affects the management console and some Admin users may get a "Server Timeout" message when navigating in the platform. Our teams are investigating the issue and we are working on deploying a fix ASAP.
- identified Jan 16, 2022, 12:10 PM UTC
The team was able to identify the root cause of the issue that affects some of the web sessions in the management console and we are working on a fix to restore full functionality.
- identified Jan 16, 2022, 12:33 PM UTC
Our team is still working on fixing the root cause of the issue in order to restore full functionality.
- identified Jan 16, 2022, 12:55 PM UTC
The team was able to identify the affected component and we are now re-deploying it in order to resume the full functionality of the platform.
- identified Jan 16, 2022, 12:59 PM UTC
Our teams were able to apply a temporary fix while re-deploying the affected services. Most users should be back to full functionality.
- resolved Jan 16, 2022, 01:20 PM UTC
The issue was fully resolved by our team.
- postmortem Jan 20, 2022, 01:33 PM UTC
**Root Cause Analysis** The sporadic errors in our Management UI were caused by an inability to preserve the connection to an internal Database due to a leadership change triggered by a failover in our Databases' redundant cluster. The failover was a result of increased memory consumption which triggered high rate swap utilization causing stress on the host system resources. We’ve identified that some UI services ignored the Database leadership change and the status change of the replica Database, hence, continued running queries on the old Database and failed once timed out. The root cause of this issue was an outdated driver that did not identify the failover and the Database leadership swap. **Corrective Actions** Immediate - The issue was resolved by having Database-01 assume full leadership after the failover of Database-02. Short-term - Updating the outdated driver to the correct version. Long-term - We will be taking advantage of the Database Cluster Auto-Scaling feature which can identify high memory utilization and automatically upgrade the cluster resources accordingly.