Neo4j Aura incident

A small number of instances are not accepting write queries

Minor Resolved View vendor source →

Neo4j Aura experienced a minor incident on September 24, 2024 affecting AuraDB Virtual Dedicated Cloud on AWS (*.databases.neo4j.io) and AuraDB Business Critical (*.databases.neo4j.io) on AWS and 1 more component, lasting 4h 21m. The incident has been resolved; the full update timeline is below.

Started
Sep 24, 2024, 10:20 AM UTC
Resolved
Sep 24, 2024, 02:41 PM UTC
Duration
4h 21m
Detected by Pingoru
Sep 24, 2024, 10:20 AM UTC

Affected components

AuraDB Virtual Dedicated Cloud on AWS (*.databases.neo4j.io)AuraDB Business Critical (*.databases.neo4j.io) on AWSAuraDB Virtual Dedicated Cloud on Azure (*.databases.neo4j.io)AuraDB Business Critical (*.databases.neo4j.io) on AzureAuraDB Free (*.databases.neo4j.io)

Update timeline

  1. investigating Sep 24, 2024, 10:20 AM UTC

    We are investigating an issue about some instances not accepting write queries.

  2. identified Sep 24, 2024, 10:38 AM UTC

    We have identified the issue and are working on rolling out a fix.

  3. identified Sep 24, 2024, 12:35 PM UTC

    A fix is now being rolled out to production.

  4. resolved Sep 24, 2024, 02:41 PM UTC

    We have rolled out a fix and confirmed the full recovery of features and complete resolution with affected customers.

  5. postmortem Oct 01, 2024, 09:15 AM UTC

    ### **What happened** We rolled out the latest release of the database on Neo4j Aura. During the rollout a small number of database instances turned to read-only mode as the out of disk protection was triggered erroneously. ### **How the service was affected** Affected database instances were placed in read-only mode \(serving read queries only\), but still displaying an online status. We were notified by customers of issues. Whilst we monitor disk usage \(to help prevent data corruption\) we didn’t detect this issue as it was triggered by some backup-restore component containers running into an out of memory condition \(2024-09-24 at 2:10 UTC\) resulting in being unable to serve disk metrics. If the operator component cannot read metrics, it falls back to using estimated values. The estimated resulting values triggered the safeguard to place the small number of affected instances in read-only mode. ‌ Working with our engineering teams, we quickly identified that adjusting the memory setting to the backup-restore component serving the disk metrics, would allow new cluster members to start successfully. A fix was released, tested in lower environments and rolled out to all affected instances \(2024-09-24 at 15:10 UTC\) and then the whole service. ### **What we are doing now** This was an extreme case of a safety feature \(out of disk protection\) causing an issue. We have immediately fixed the issue by preventing the backup-restore to run out of memory in the same conditions. We believe this deserves a number of further changes that we we are carrying out to better prevent, detect and mitigate issues affecting this out of disk protection feature: * Fixing the issue: implement a fix for the out of disk in case it cannot receive any metric. * Aura console instance status display: implement a change to reflect the instance read-only mode. * Detection: implement an alert to detect a surge of instances going out of disk. * Prevention: implement an alert to detect rate of OOM for components class involved in the OOD protection chain of decision.