Close incident

Degraded Dialer Performance

Minor Resolved View vendor source →

Close experienced a minor incident on March 10, 2025 affecting Dialer, lasting 25m. The incident has been resolved; the full update timeline is below.

Started
Mar 10, 2025, 04:02 PM UTC
Resolved
Mar 10, 2025, 04:28 PM UTC
Duration
25m
Detected by Pingoru
Mar 10, 2025, 04:02 PM UTC

Affected components

Dialer

Update timeline

  1. investigating Mar 10, 2025, 04:02 PM UTC

    We've become aware of degraded performance of our Dialer service. We are investigating the issue. Updates will be posted as they become available.

  2. monitoring Mar 10, 2025, 04:21 PM UTC

    We are continuing to investigate the cause of the degraded performance of our Dialer system. Our Dialer system is now functioning normally. We are monitoring performance.

  3. resolved Mar 10, 2025, 04:28 PM UTC

    This incident has been resolved.

  4. postmortem Mar 12, 2025, 05:39 PM UTC

    Close sincerely apologizes for the interruption of our service. We take the stability of our platform very seriously. Below is an explanation of what happened and how we will prevent another such interruption from occurring. ## Impact Dialer functionality was impaired for 58 minutes from 15:20 UTC to 16:18 UTC on March 10th 2025. During this time the Dialer feature could get stuck in “connecting” state. ## Root Cause and Resolution The issue was triggered at 15:20 UTC by a service rebalance that caused a number of client connections to close simultaneously. When these clients attempted to reconnect, the sudden spike in traffic that occurred in peak traffic conditions exceeded system limits, leading to service disruptions. Our team quickly identified the cause and worked to stabilize the system. We restored normal operations by 16:18 UTC. To prevent similar incidents in the future, we are reviewing system thresholds and improving our ability to handle sudden increases in demand. ## Timeline * 15:20 UTC - a service rebalance occurs, starting a wave of new connections being established * 15:21 UTC - a portion of requests starts getting dropped due to rate limits * 15:28 UTC - alerts trigger and our response team began identifying the root cause * 15:36 UTC - the rate of dropped requests subsides, but then increases again soon due to a wave-like pattern of retries * 16:18 UTC - final wave of increased errors finishes and situation returns to normal operational levels