ClearlyIP incident

Clearly Cloud USA registration issues

ClearlyIP experienced a major incident on May 14, 2025 affecting Clearly Cloud USA - w, lasting 3h 8m. The incident has been resolved; the full update timeline is below.

Started: May 14, 2025, 03:52 PM UTC
Resolved: May 14, 2025, 07:00 PM UTC
Duration: 3h 8m
Detected by Pingoru: May 14, 2025, 03:52 PM UTC

Affected components

Clearly Cloud USA - w

Update timeline

investigating May 14, 2025, 03:52 PM UTC

We are seeing intermittent registration issues on our Clearly Cloud US servers.
monitoring May 14, 2025, 04:07 PM UTC

A fix has been implemented and we are monitoring the results.
identified May 14, 2025, 04:24 PM UTC

The issue has been identified and a fix is being implemented.
monitoring May 14, 2025, 04:48 PM UTC

A fix has been implemented and we are monitoring the results.
identified May 14, 2025, 04:55 PM UTC

The issue has been identified and a fix is being implemented.
identified May 14, 2025, 05:44 PM UTC

We are continuing to work on a fix for this issue.
monitoring May 14, 2025, 06:35 PM UTC

A fix has been implemented and we are monitoring the results.
resolved May 15, 2025, 03:21 PM UTC

This incident has been resolved.
postmortem May 15, 2025, 09:05 PM UTC

During the morning of Wednesday May 14, we began receiving reports of intermittent issues from Clearly Cloud users in parts of the US with registrations and call completion. The engineering team immediately began investigating and closely monitored the performance of all related systems for the next several hours, identifying the cause of the problem. A full analysis confirmed that call signaling messages were being jammed up during three timeframes, each approximately ~13 minutes, ~6 minutes, and ~5 minutes in duration. The cause of the problem was a replica \(spare\) database server unable to keep up with the synchronization from the primary systems, causing widespread delays and backups completing tasks like registration handling. Once the cause was identified as a non-production server, our team began to take immediate action by disabling its role and disconnecting it from the production systems while evaluating what was behind its performance problem. This prevented additional issues beyond those experienced. Metrics and logs indicated a likely hardware issue. Thankfully, ClearlyIP's recent investments in improving its Central US datacenter operations meant our team could quickly move the replica database to this newer environment. That migration was completed several hours after the issues began, situating the replica server in the new environment. It was then restored to service late afternoon. No similar Clearly Cloud issues were observed after the replica server was taken out of service, or since it was restored to service. Our teams will continue to proactively monitor the performance of these systems and study additional improvements which can minimize the impact of similar circumstances for the future.