Neo4j Aura incident

AuraDB instances on AWS affected by unavailability

Neo4j Aura experienced a major incident on April 29, 2024 affecting AuraDB Virtual Dedicated Cloud on AWS (*.databases.neo4j.io) and AuraDB Professional on AWS (*.databases.neo4j.io) and 1 more component, lasting 3h 35m. The incident has been resolved; the full update timeline is below.

Started: Apr 29, 2024, 08:57 AM UTC
Resolved: Apr 29, 2024, 12:33 PM UTC
Duration: 3h 35m
Detected by Pingoru: Apr 29, 2024, 08:57 AM UTC

Affected components

AuraDB Virtual Dedicated Cloud on AWS (*.databases.neo4j.io)AuraDB Professional on AWS (*.databases.neo4j.io)AWS ec2-ap-south-1AWS eks-ap-south-1AWS s3-ap-south-1AWS ec2-ap-southeast-2AWS eks-ap-southeast-2AWS s3-ap-southeast-2AWS s3-ca-central-1AWS ec2-ca-central-1

Update timeline

monitoring Apr 30, 2024, 03:57 PM UTC

Some AuraDB instances running on AWS have lost temporarily availability before recovering automatically.
resolved Apr 30, 2024, 03:58 PM UTC

This incident is now resolved
postmortem Apr 30, 2024, 03:59 PM UTC

## **What happened** On 2024-04-29 at 08:57:35 UTC, Neo4j implemented a code update that had unintended consequences for our infrastructure management. The aforementioned code introduced a bug in our cluster management tool, which became apparent when an update to our Kubernetes clusters resulted in the termination of VMs hosting certain databases. While the system self-healed, this led to instances becoming temporarily unavailable for some AWS customers. Since our code rollout occurs progressively across environments, the impact was staggered across AWS service tiers. ## **How the service was affected** Some of our AWS customers \(running instances up to 32GB of RAM\) experienced a <10 minute service interruption between 2024-04-29 08:57:35 UTC and 2024-04-29 12:33:40 UTC when the last batch of impacted databases completed the self-healing process. The impact occurred during multiple time intervals. Each interval affected a different group of instances, resulting in shorter recovery times for individual instances. This occurred in the following time intervals as the roll out was staggered: * 08:57:35 - 09:09:50 UTC * 09:11:55 - 09:25:05 UTC * 09:37:05 - 09:48:55 UTC * 10:08:35 - 10:20:50 UTC * 11:13:00 - 11:27:10 UTC * 12:10:20 - 12:23:40 UTC * 12:19:30 - 12:33:40 UTC ## **What we are doing now** After conducting a comprehensive analysis of the situation, we are taking decisive actions to prevent such incidents from occurring in the future. Our efforts fall into two key areas: ##### **Immediate Actions:** * Enhancing monitoring and alert systems for the underlying Aura infrastructure, with a focus on detecting and responding to individual \(or small groups of\) database unavailability promptly. * Strengthening testing protocols and code review processes to identify and address bugs in the components managing Cloud Infrastructure at an earlier stage. These measures are aimed at ensuring the reliability and stability of our services moving forward.