Elium incident

Load on one of our Outscale K8S cluster node

Minor Resolved View vendor source →

Elium experienced a minor incident on July 7, 2021 affecting Private Hosting, lasting 22h 55m. The incident has been resolved; the full update timeline is below.

Started
Jul 07, 2021, 05:20 PM UTC
Resolved
Jul 08, 2021, 04:16 PM UTC
Duration
22h 55m
Detected by Pingoru
Jul 07, 2021, 05:20 PM UTC

Affected components

Private Hosting

Update timeline

  1. investigating Jul 07, 2021, 05:20 PM UTC

    We have detected an abnormal load on one of the nodes of our Outscale kubernetes cluster. We had to restart it.

  2. investigating Jul 07, 2021, 05:22 PM UTC

    Restarting the node solved the load problem. We are still checking why this load occurred. Currently, the services are working properly again.

  3. identified Jul 07, 2021, 08:41 PM UTC

    During rolling updates, restarting containers on the node produces timeouts

  4. identified Jul 08, 2021, 08:27 AM UTC

    We are trying to solve the node load problem. This creates slowness on the instances of clients hosted on our private hosting (Outscale) when the services restart on the node.

  5. identified Jul 08, 2021, 08:46 AM UTC

    We completely recreated the node and redeployed the services. The load continues to increase abnormally and this impacts the customer instances. We have therefore, once again, disabled the services on this node.

  6. identified Jul 08, 2021, 09:34 AM UTC

    We still testing different configurations for the faulty node (different kernel version, create another node).

  7. monitoring Jul 08, 2021, 12:11 PM UTC

    We have created a new node using different hardware specifications (CPU type). After several tests, we found that the abnormal load problem no longer occurs on this type of machine. We continue to monitor the behaviour of this node. At the same time, we are reporting our findings to 3DS Outscale support in order to validate that the problem comes from the type of machine used for this node.

  8. resolved Jul 08, 2021, 04:16 PM UTC

    We performed several tests (including the deployment of a new version of the Elium services) to validate that the new node is stable.