Nango incident

Nango is experiencing probl...

Nango experienced a major incident on October 6, 2025 affecting Nango Cloud Health, lasting 15h 40m. The incident has been resolved; the full update timeline is below.

Started: Oct 06, 2025, 05:22 PM UTC
Resolved: Oct 07, 2025, 09:02 AM UTC
Duration: 15h 40m
Detected by Pingoru: Oct 06, 2025, 05:22 PM UTC

Affected components

Nango Cloud Health

Update timeline

investigating Oct 06, 2025, 05:22 PM UTC

Nango is experiencing problems and we are investigating.
resolved Oct 07, 2025, 09:02 AM UTC

Post-Incident Summary — Database Lock Contention and Service Outage Date: 6 October 2025 Impact: Temporary outage of Nango services Status: Resolved Summary A sustained, high-volume workload from a downstream integration generated a large number of webhook events over an extended period. The follow-on fetches (getRecords) executed concurrently at scale, and a small update in that code path (persisting a “last fetched” timestamp) created heavy database contention. Lock saturation and connection exhaustion led to elevated latencies and 499 responses across public APIs until mitigations were applied. Timeline (CET) - Issue began: 19:21 - Detected by monitoring: 19:37 - Mitigated: 20:35 - Fully resolved: 23:17 Root Cause The getRecords flow includes a database write to persist a “last fetched” timestamp. Under extreme concurrency, these updates serialized and waited on one another, creating widespread lock contention. As locks piled up, available connections were exhausted, which introduced long response times and impacted all Nango services. Resolution - Logs and metrics showed elevated lock waits and deadlock messages (over 18,000 waiting operations at peak). - High-load webhook sources were temporarily disabled to shed database work, after which services recovered to normal. - The “last fetched” update in getRecords was determined to be unnecessary and removed to eliminate this contention pattern. All systems were fully operational by 20:35 CET. Follow-Up Actions - Rate limiting on webhook endpoints. - More controls to limit the blast-radius of isolated single-tenant spikes. - Enhanced observability for lock growth and connection-pool saturation to enable earlier, automated mitigation.