Perimeter 81 incident

Some users are experiencing an issue with using the Perimeter 81 service

Critical Resolved View vendor source →

Perimeter 81 experienced a critical incident on March 14, 2022, lasting 22h 47m. The incident has been resolved; the full update timeline is below.

Started
Mar 14, 2022, 09:58 AM UTC
Resolved
Mar 15, 2022, 08:46 AM UTC
Duration
22h 47m
Detected by Pingoru
Mar 14, 2022, 09:58 AM UTC

Update timeline

  1. investigating Mar 14, 2022, 09:58 AM UTC

    Our team identified an issue with one of our session authentication services. We are currently working on identifying the root cause of the issue.

  2. identified Mar 14, 2022, 10:15 AM UTC

    The issue was identified by the team and we are working on applying a fix to restore the full functionality of the platform.

  3. identified Mar 14, 2022, 10:26 AM UTC

    We are continuing to work on a fix for this issue.

  4. identified Mar 14, 2022, 10:53 AM UTC

    The team is still working on a fix for the issue in order to restore full functionality.

  5. identified Mar 14, 2022, 11:21 AM UTC

    The team is currently focusing on restoring the functionality of one of our core features that failed due to connectivity issues. We understand the impact on our customers and working relentlessly on addressing the issue ASAP or applying a workaround.

  6. identified Mar 14, 2022, 11:45 AM UTC

    Our entire team is currently working on addressing the issue and we are doing our best to resolve it. We understand the impact of the outage on our customers and treat the issue with the utmost urgency.

  7. identified Mar 14, 2022, 12:23 PM UTC

    The team was able to restore the functionality of our Zero-Trust applications and admin console. We are still working on restoring the Agent access to the Perimeter 81 networks.

  8. identified Mar 14, 2022, 12:45 PM UTC

    We are gradually loading different components of the system to prevent abnormal load on our core infrastructure. Some of the services are already up and we are working on allowing agent access to Perimeter 81 networks.

  9. identified Mar 14, 2022, 01:32 PM UTC

    The team is still working on bringing up some of the remaining services as it seems that some are still unable to communicate with our session authentication service.

  10. identified Mar 14, 2022, 02:11 PM UTC

    We are continuing to work on a fix for this issue.

  11. identified Mar 14, 2022, 02:48 PM UTC

    Our team is working to resolve this complex technical issue in order to restore functionality. We are rolling back the version of some of our components to the last-known working version to restore the agent access to Perimeter 81 networks.

  12. monitoring Mar 14, 2022, 02:58 PM UTC

    The team was able to revert some of our services to the last working configuration. All users should now be able to log in to the platform. We are now monitoring the system to verify the full functionality of our entire user base.

  13. resolved Mar 15, 2022, 08:46 AM UTC

    This incident has been resolved.

  14. postmortem Mar 16, 2022, 02:21 PM UTC

    **Overview** The Perimeter 81 engineering team reported missing messages in the system due to a synchronization problem between two nodes of the RabbitMQ cluster. The engineering team started the investigation and found that we have network connectivity failures to one of the RabbitMQ nodes, and decided to restart the problematic node in order to initiate a sync. While the RabbitMQ node was recovering, other micro-services failed to connect to it and performed an automated restart as part of our automatic recovery process. This affected all users which got disconnected and started reconnecting at once. It also created a huge load of messages on all of our systems and an influx of unhandled connections to the platform. At this point we’ve realized that we need to increase the number of instances to handle the load created by the simultaneous boot of the entire platform, but also to address the unhandled connections issue which created a significant ever-increasing load on the system. This action required applying a hot-fix to our production code. ‌ **Root Cause Analysis** The issue was combined out of several significant failures that had to be identified and addressed simultaneously: * Unsynced RabbitMQ nodes caused by a network error. * Wrongly performed reboot of one of the RabbitMQ nodes which caused services to disconnect from the platform. * Inability to handle the load of a full-platform reboot sequence due to insufficient resources and unaddressed unhandled connections. ‌ **Resolution and Corrective Actions** Our engineering team performed several actions in order to address this multi-layered set of failures: * Added additional resources to our core processes to better address the significant load generated by the full-platform reboot. * Applied a hot-fix to our production code to address the accumulated number of connections. * Restarted the system micro-services to gracefully initiate the connection to the platform.