Neo4j Aura incident
Endpoint ingress degradation in AWS us-east-1, us-west-2 and ap-northeast-2
Neo4j Aura experienced a major incident on January 16, 2025 affecting AuraDB Professional on AWS (*.databases.neo4j.io) and AuraDS on AWS (*.databases.neo4j.io) and 1 more component, lasting 4h 20m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- identified Jan 16, 2025, 03:56 PM UTC
Endpoint ingress degradation in AWS us-east-1, us-west-2 and ap-northeast-2 was introduced earlier on January 16, 2025. We have identified a fix and are currently deploying it to all impacted regions.
- monitoring Jan 16, 2025, 06:48 PM UTC
A fix has been deployed across all impacted regions. This issue has been resolved and we will monitor to ensure the service remains healthy before considering this Resolved.
- resolved Jan 16, 2025, 08:16 PM UTC
The incident is resolved. A postmortem will be available once it is complete.
- postmortem Feb 21, 2025, 01:40 PM UTC
## **What happened** A configuration change in Aura’s DB Ingress service caused intermittent connectivity issues for customer databases across multiple AWS regions. The issue was due to a misconfiguration of AWS Network Load Balancers \(NLBs\), which resulted in dropped inbound traffic. A change focused on making the service more efficient rolled on 2025-01-15 and reduced db-ingress replicas to three per region, affecting AWS regions with more than three availability zones \(AZs\). The AWS Network Load Balancer \(NLB\) couldn't route traffic correctly when requests landed in AZs without a db-ingress pod. Cross-zone load balancing was not enabled, preventing the NLB from distributing traffic across zones. This led to intermittent connection failures in us-east-1, us-west-2, and ap-northeast-2. On 2025-01-16 the fix was deployed. ## **How the service was affected** Intermittent connectivity failures impacted Aura Professional, Business Critical, and DS Enterprise orchestras. The issue was caused by traffic being dropped by the AWS Network Load Balancer \(NLB\) due to improper routing. As a result, failure rates \(for queries using the Bolt protocol\) reached approximately 40% in us-east-1, 25% in us-west-2, and 25% in ap-northeast-2. We reverted a change that reduced the number of db-ingress replicas, ensuring that instances were running in all availability zones and this restored normal database operations. ## **What we are doing now** Neo4j remains committed to providing reliable service and is implementing additional safeguards to prevent similar incidents in the future. To prevent similar incidents in the future, we are implementing the following improvements: * **Enable cross-zone Load Balancing**: Ensuring traffic is correctly distributed across all availability zones. * **Adding automated monitoring** to detect ingress failures before deployment. * **Adding improved Alarms** to detect and respond to connectivity issues quickly.