Neo4j Aura incident

Performance degradation - High CPU load on Aura 5

Neo4j Aura experienced a minor incident on March 14, 2024 affecting AuraDB Virtual Dedicated Cloud on AWS (*.databases.neo4j.io) and AuraDB Professional on AWS (*.databases.neo4j.io), lasting 18h 58m. The incident has been resolved; the full update timeline is below.

Started: Mar 14, 2024, 08:24 PM UTC
Resolved: Mar 15, 2024, 03:23 PM UTC
Duration: 18h 58m
Detected by Pingoru: Mar 14, 2024, 08:24 PM UTC

Affected components

AuraDB Virtual Dedicated Cloud on AWS (*.databases.neo4j.io)AuraDB Professional on AWS (*.databases.neo4j.io)

Update timeline

identified Mar 14, 2024, 08:24 PM UTC

Some customers running Aura 5 may experience higher than usual CPU level on their instances. We have identified the root cause and are actively working on a fix and preparing for its roll out. Meanwhile, please be assured that we are actively monitoring instances and are taking mitigating actions.
identified Mar 14, 2024, 10:03 PM UTC

Our engineers continue working on a fix that will be rolled out when it is ready. In the meanwhile, please be assured that we're actively monitoring instances and are taking mitigation actions.
identified Mar 15, 2024, 12:13 AM UTC

Our engineers are continuing to work on a fix. We will continue actively monitoring instances and taking any mitigation actions as needed.
identified Mar 15, 2024, 02:30 AM UTC

Our engineers continue working on a fix that will be rolled out when it is ready. In the meanwhile, please be assured that we're actively monitoring instances and are taking mitigation actions.
identified Mar 15, 2024, 04:34 AM UTC

Our engineers are continuing to work on a fix. We will continue actively monitoring instances and taking any mitigation actions as needed.
identified Mar 15, 2024, 06:35 AM UTC

Our engineers continue working on a fix that will be rolled out when it is ready. In the meanwhile, please be assured that we're actively monitoring instances and are taking mitigation actions.
identified Mar 15, 2024, 08:50 AM UTC

We have a fix ready and will be rolling it out today. We will continue actively monitoring instances and taking any mitigation actions as needed.
identified Mar 15, 2024, 11:07 AM UTC

The fix is currently being rolled out now and will be completed today. We will continue actively monitoring instances and taking any mitigation actions as needed.
identified Mar 15, 2024, 01:00 PM UTC

We are continuing to work on a fix for this issue.
identified Mar 15, 2024, 01:03 PM UTC

We are progressing through the roll out of the fix and have now completed the Aura Professional tier. We are on course to finish today. We continue to monitor and will take any proactive mitigating action if necessary until the fix is fully rolled out.
monitoring Mar 15, 2024, 01:42 PM UTC

The fix has been rolled out and we will now be monitoring.
resolved Mar 15, 2024, 03:23 PM UTC

The roll out of the fix is complete and the service is fully restored.
postmortem Jun 04, 2024, 11:12 AM UTC

### **What happened** A change in the way we call the RAFT resolver \(clustering protocol\) as part of how single node instances are managed resulted in a big increase of an internal API calls. This caused a latent memory leak to become apparent and in the process forced the Java garbage collection to run with high intensity consuming valuable CPU resources. ### **How the service was affected** The issue only affected single node instances on the AuraDB Free tier. Users would notice issues with performance on queries or operations requiring CPU resources. ### **What we are doing now** * Considering running the changes over a soak period to allow better detection of slow memory leak * Reviewing how we can better detect these conditions amongst the running of the service and better detect a pattern of issues with CPU usage. * Improving internal handling of early warning signs of some alarms and make a better impact assessment