Netop incident

connect.backdrop.cloud not loading

Netop experienced a notice incident on February 29, 2024 affecting Remote access services and Login and 1 more component, lasting 6h 19m. The incident has been resolved; the full update timeline is below.

Started: Feb 29, 2024, 10:55 AM UTC
Resolved: Feb 29, 2024, 05:15 PM UTC
Duration: 6h 19m
Detected by Pingoru: Feb 29, 2024, 10:55 AM UTC

Affected components

Remote access servicesLoginNetop Portal frontendOnDemand

Update timeline

investigating Feb 29, 2024, 10:55 AM UTC

We are currently investigating this issue.
identified Feb 29, 2024, 12:55 PM UTC

The issue has been identified and a fix is being implemented.
monitoring Feb 29, 2024, 02:23 PM UTC

A fix has been implemented and we are monitoring the results.
resolved Feb 29, 2024, 05:15 PM UTC

This incident has been resolved.
postmortem Mar 04, 2024, 09:18 AM UTC

A core component of the Connect system which ordinarily works in a paired node configuration \(2 server nodes performing the same function\) became unstable. When troubleshooting this, we found that one of the nodes had low memory and needed to be resized to allocate more memory. The resizing operation requires the target node to be restarted and takes it out of service for the time it takes to restart, usually around 2 minutes or so. At the point of restarting the target node, the whole Connect platform failed entirely which should not have happened. We found at this point, that the remaining node that should have taken the load temporarily did not have the service running for this part of the system. At this time focus shifted to try and get the service running on the primary node as the secondary node \(the targeted node\) could not start the service until it was able to communicate with the primary node. The underlying cause of the service not starting on the primary node was due to a corruption on one of many message queues it is designed to process. We have learnt from this incident that we need to add additional monitoring to our systems to identify these types of failure better. We have also noted that we need to add steps to our processes, for this type of operation, to check the health of all nodes running in pairs or sets, to ensure that they have the expected service running, and that they are able to cope with temporary increases in load so this type of standard operation does not result in a catastrophic failure in the future. Whilst this is incident is regrettable, we believe that lessons learnt from this incident will help to strengthen our operations and contribute to the future stability of Impero Connect.