Neo4j Aura incident

Increased Latency on Customer Metric Integration Requests

Minor Resolved View vendor source →

Neo4j Aura experienced a minor incident on May 21, 2025 affecting AuraDB Virtual Dedicated Cloud on AWS (*.databases.neo4j.io) and AuraDB Professional on AWS (*.databases.neo4j.io) and 1 more component, lasting 14h 17m. The incident has been resolved; the full update timeline is below.

Started
May 21, 2025, 06:51 PM UTC
Resolved
May 22, 2025, 09:09 AM UTC
Duration
14h 17m
Detected by Pingoru
May 21, 2025, 06:51 PM UTC

Affected components

AuraDB Virtual Dedicated Cloud on AWS (*.databases.neo4j.io)AuraDB Professional on AWS (*.databases.neo4j.io)AuraDS on AWS (*.databases.neo4j.io)AuraDS Enterprise on AWS (*.databases.neo4j.io)AuraDB Business Critical (*.databases.neo4j.io) on AWSAuraDB Virtual Dedicated Cloud on Azure (*.databases.neo4j.io)AuraDB Professional on Azure (*.databases.neo4j.io)AuraDS Enterprise on Azure (*.databases.neo4j.io)AuraDS on Azure (*.databases.neo4j.io)AuraDB Business Critical (*.databases.neo4j.io) on Azure

Update timeline

  1. investigating May 21, 2025, 01:40 PM UTC

    Engineering have identified an increase in latency with our Customer Metric Integration endpoints, they are currently investigating the cause. If you are seeing timeouts when attempting to fetch metrics we recommend increasing your timeout value to 20 seconds temporarily until this incident is resolved.

  2. investigating May 21, 2025, 02:44 PM UTC

    Engineering are continuing to investigate the root cause of the issue. If you are seeing timeouts when attempting to fetch metrics we recommend increasing your timeout value to 20 seconds temporarily until this incident is resolved.

  3. investigating May 21, 2025, 04:13 PM UTC

    Engineering are continuing to investigate the root cause of the issue. If you are seeing timeouts when attempting to fetch metrics we recommend increasing your timeout value to 20 seconds temporarily until this incident is resolved.

  4. investigating May 21, 2025, 05:35 PM UTC

    Engineering are continuing to investigate the root cause of the issue. If you are seeing timeouts when attempting to fetch metrics we recommend increasing your timeout value to 20 seconds temporarily until this incident is resolved.

  5. investigating May 21, 2025, 06:51 PM UTC

    Engineering are continuing to investigate the root cause of the issue. If you are seeing timeouts when attempting to fetch metrics we recommend increasing your timeout value to 20 seconds temporarily until this incident is resolved.

  6. investigating May 21, 2025, 07:57 PM UTC

    Engineering are continuing to investigate the root cause of the issue. If you are seeing timeouts when attempting to fetch metrics we recommend increasing your timeout value to 20 seconds temporarily until this incident is resolved.

  7. investigating May 21, 2025, 08:50 PM UTC

    Engineering are continuing to investigate the root cause of the issue. If you are seeing timeouts when attempting to fetch metrics we recommend increasing your timeout value to 20 seconds temporarily until this incident is resolved.

  8. investigating May 21, 2025, 09:48 PM UTC

    Engineering are continuing to investigate the root cause of the issue. If you are seeing timeouts when attempting to fetch metrics we recommend increasing your timeout value to 20 seconds temporarily until this incident is resolved.

  9. investigating May 21, 2025, 10:47 PM UTC

    Engineering are continuing to investigate the root cause of the issue. If you are seeing timeouts when attempting to fetch metrics we recommend increasing your timeout value to 20 seconds temporarily until this incident is resolved.

  10. investigating May 21, 2025, 11:46 PM UTC

    Engineering are continuing to investigate the root cause of the issue. If you are seeing timeouts when attempting to fetch metrics we recommend increasing your timeout value to 20 seconds temporarily until this incident is resolved.

  11. investigating May 22, 2025, 12:45 AM UTC

    Engineering are continuing to investigate the root cause of the issue. If you are seeing timeouts when attempting to fetch metrics we recommend increasing your timeout value to 20 seconds temporarily until this incident is resolved.

  12. investigating May 22, 2025, 04:30 AM UTC

    Our Engineering team is continuing to investigate the root cause of the issue. If you are seeing timeouts when attempting to fetch metrics, we recommend temporarily increasing your timeout value to 20 seconds until this incident is resolved.

  13. monitoring May 22, 2025, 07:37 AM UTC

    The issue has been resolved, and we are currently monitoring the system to ensure continued stability. If you were experiencing timeouts while fetching metrics, those should now be resolved. We will continue to observe the system and provide further updates if necessary.

  14. resolved May 22, 2025, 09:09 AM UTC

    The latency issue affecting our Customer Metric Integration endpoints has been resolved. All services operate normally now, and timeout errors should no longer occur.

  15. postmortem Jun 13, 2025, 10:11 AM UTC

    ### **What happened** At approximately 17:00 UTC on 2025-05-20 our cloud provider released and rolled out a change on the managed version of Prometheus we use to provide the Customer Metrics Integration \(CMI\) endpoint. This change affected our production PromQL query performance because “_the change to the PromQL query path now evaluates queries that previously had empty results_”. This was a change we had no warning and no control over and effected multiple customers. We quickly raised the issue to our cloud provider and they rolled back the change. While we were checking the root cause of the issue we immediately recommended increasing the timeout value to 20 seconds as a remediation. ### **How the service was affected** Customers with low timeout settings on their PromQL queries to fetch metrics from the Neo4j Aura CMI endpoint would see an increase in query timeouts \(HTTP error 499\). ### **What we are doing now** This incident was not caused by anything Neo4j directly controls but we have been looking at improving our handling of this situation and have devised the following actions: * Provided feedback to our cloud provider on the impact this has had on our service. * Added CMI endpoint on the statuspage: Aura Customer Metrics \([customer-metrics-api.neo4j.io](http://customer-metrics-api.neo4j.io)\) to better represent the status of the service. * Updated our documentation and recommended a larger timeout be set. * Reviewed improvements to our detection and alerting in timeouts and errors to queries \(499 and 5xx errors\) * Review our own queries to make them more efficient and more resilient to a performance degradation from our suppliers’ service