Nango incident

2026-06-11 Incident post-mo...

Major Resolved View vendor source →

Nango experienced a major incident on June 16, 2026 affecting Nango Cloud Health, lasting —. The incident has been resolved; the full update timeline is below.

Started
Jun 16, 2026, 06:10 PM UTC
Resolved
Jun 16, 2026, 06:10 PM UTC
Duration
Detected by Pingoru
Jun 16, 2026, 06:10 PM UTC

Affected components

Nango Cloud Health

Update timeline

  1. resolved Jun 16, 2026, 06:10 PM UTC

    2026-06-11 Incident post-mortem Summary On June 11, 2026, our jobs service (which runs syncs, actions, and webhooks) became progressively overloaded and then unavailable for a period of roughly five hours, with customer impact concentrated in the final ~80 minutes. During this window, actions and syncs were slow, some timed out (failed), and queued work built up before being processed. No data was lost, delayed tasks were retried and worked off once the service recovered. We have fully mitigated the incident and have a concrete plan to prevent recurrence. Customer impact What: Elevated latency on actions and syncs, some actions and syncs timed out, webhook and sync processing was delayed as a backlog formed. When: June 11, 2026, from approximately 21:20 UTC (gradual degradation) to 02:03 UTC on June 12 (mitigated). The heaviest impact was between ~00:40 and ~02:03 UTC. Data integrity: No data was lost. Delayed work was queued and processed once the service recovered. What happened Our processing service was sized with very little spare capacity relative to its normal load. At about 20:00 UTC a first instance stopped responding due to being overloaded. This triggered escalating crash-and-recover waves: when the service came back, each instance immediately picked up the full incoming workload plus the work that had queued during the restart. Because there was little headroom, the instances ran out of CPU faster than they could work through the backlog. Overloaded instances failed their health checks and were automatically restarted by our infrastructure, which returned them to the same overloaded starting point. This created a self-reinforcing loop: each restart left more work queued, which made the next start-up harder, until the service could no longer keep up. Resolution We broke the loop by significantly increasing the processing capacity available to the service (raising its scaling ceiling). With more instances sharing the load, each one had enough CPU headroom to pass its health checks, stay running, and work through the backlog. The service returned to normal latency and the queued work was fully processed. What we're doing to prevent recurrence Right-sizing the service: increasing its CPU allocation and lowering the threshold at which it scales out, so it always runs with meaningful headroom. We are reviewing our other services for the same pattern. Earlier detection: adding monitoring on the leading indicators (runtime/CPU health, unavailable instances, and scaling-at-maximum) so we catch this class of degradation well before it affects customers. We apologize for the disruption. If you have questions about how this affected you, please reach out.