Neo4j Aura incident
Customer Metrics Integration (CMI) unavailable
Neo4j Aura experienced a minor incident on January 15, 2025 affecting AuraDB Virtual Dedicated Cloud on AWS (*.databases.neo4j.io) and AuraDS Enterprise on AWS (*.databases.neo4j.io) and 1 more component, lasting 3h 57m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- identified Jan 15, 2025, 06:20 PM UTC
The Neo4j Aura Customer Metrics Integration (CMI) is currently unavailable. We have identified a fix and are preparing to roll it out.
- monitoring Jan 15, 2025, 07:39 PM UTC
A fix has been rolled out and we are monitoring to ensure CMI is fully operational for all instances.
- resolved Jan 15, 2025, 10:17 PM UTC
A fix has been in place for some time and this incident is considered resolved. A portmortem will be forthcoming.
- postmortem Feb 07, 2025, 11:45 AM UTC
### **What happened** On 2025-01-15 at 15:42:02 UTC our secure endpoint to provide Customer Metrics Integration became unavailable and returned the error \{"message":"Failed to validate JWT."\} due to the expiry of the SSL certificate. Whilst we had renewed the certificate our framework for deploying and rolling out components had updated service accounts but not updated the associated service key secret in the correct sequence. ### **How the service was affected** Customers collecting and ingesting metrics from their Neo4j Aura instances, were no longer able to do so as a result of connectivity issues with the provided endpoints. The issue was due to the requirement for valid encryption \(and an up to date certificate\) to connect to [customer-metrics-api.neo4j.io](http://customer-metrics-api.neo4j.io/). We detected it internally when rolling out an update and soon after received reports of issues from customers. We worked to create a new service account and associated secret key to be rolled out immediately. ### **What we are doing now** We recognise that this issue caused serious issues in monitoring and operating Neo4j Aura instances and we have committed to the following actions: * Monitoring: We built additional monitoring metrics and dashboards and derived alarms to our cloud operations team to detect issues with failed connection * Mitigation: We are improving how we roll these changes related to service accounts secrets key. * Prevention: For any service account secret key deletion and updating with a new one we will not bundle this work anymore but split the tasks associated with the changes.