Nango incident

Degradation in sync executions

Minor Resolved View vendor source →
Started
Apr 12, 2026, 08:00 AM UTC
Resolved
Apr 14, 2026, 08:00 AM UTC
Duration
2d
Detected by Pingoru
Apr 12, 2026, 08:00 AM UTC

Affected components

Nango Cloud Health

Update timeline

  1. investigating Apr 12, 2026, 08:00 AM UTC

    Syncs are still delayed, while actions appear unaffected. The issue seems to be related to how the database is handling sync schedules. Once sync processing recovers, synced data will catch up automatically.

  2. resolved Apr 14, 2026, 08:00 AM UTC

    Post-Incident Summary Date: 12 April 2026 Impact: Degraded sync execution and delayed actions and webhook processing Status: Resolved Summary A webhook flood originating from a single customer environment caused one of our databases to saturate, resulting in broad degradation of asynchronous job processing. Sync execution dropped to near zero, and a large portion of actions and webhook-driven work were delayed or unable to run. A secondary bug in the scheduling system amplified the incident and blocked two consecutive recovery attempts before a fix was deployed. Timeline (UTC) Issue began: 07:00 Detected by monitoring: 07:00 Status page updated: 07:00 Mitigated: 15:30 Resolved: 15:55 Root Cause A single customer environment generated a sustained webhook flood, well above the typical baseline. Each incoming webhook triggered a database query to check the current queue depth for that customer's group before deciding whether to admit a new task. Under flood conditions, this query saturated one of our databases' CPU, preventing other work — including syncs and actions from all customers — from being scheduled or processed. Once the per-group queue cap was reached, new work could no longer be enqueued, and the system remained effectively stalled. Recovery was complicated by a separate bug in the recurring schedule path. When the scheduler encountered a group that had already hit the queue cap, an error in the code caused the exception to be swallowed silently. As a result, affected schedules were never marked as processed and were repeatedly retried on each scheduler tick, adding further load to an already saturated database. This caused two consecutive recovery attempts to fail. Resolution A fix was deployed to correct the scheduling bug, ensuring that capped groups are handled correctly and schedules are properly advanced after each pass. Task execution times were shifted forward in bulk to drain pressure from the database, then restored in batches. Once the backlog cleared, the system returned to a healthy state and full processing resumed by 15:55 UTC. Follow-Up Actions System safeguards Improve the current per-enqueue queue-depth admission-control mechanism to reduce database load under flood conditions. Define a rate limiting and load shedding strategy for webhook ingestion to protect the platform when a single customer generates sustained enqueue pressure. Fix the scheduling bug to correctly handle capped groups without silent failures (completed).

Looking to track Nango downtime and outages?

Pingoru polls Nango's status page every 5 minutes and alerts you the moment it reports an issue — before your customers do.

  • Real-time alerts when Nango reports an incident
  • Email, Slack, Discord, Microsoft Teams, and webhook notifications
  • Track Nango alongside 5,000+ providers in one dashboard
  • Component-level filtering
  • Notification groups + maintenance calendar
Start monitoring Nango for free

5 free monitors · No credit card required