Cloudli incident

Cloudli Connect Inbound calls | Appels entrants Cloudli Connect

Cloudli experienced a major incident on October 21, 2025 affecting Inbound Calling and Inbound Calling, lasting 1h 37m. The incident has been resolved; the full update timeline is below.

Started: Oct 21, 2025, 01:45 PM UTC
Resolved: Oct 21, 2025, 03:22 PM UTC
Duration: 1h 37m
Detected by Pingoru: Oct 21, 2025, 01:45 PM UTC

Affected components

Inbound CallingInbound Calling

Update timeline

investigating Oct 21, 2025, 01:28 PM UTC

We are currently experiencing an issue affecting incoming calls to some customers in USA and we are investigating. Impacts: This could potentially cause abnormal delays or an inability to initiate or receive calls. *** Nous rencontrons actuellement un problème affectant les appels entrants vers certains clients aux États-Unis et nous enquêtons. Impacts: Ceci pourrait potentiellement entraîner des délais anormaux ou à une incapacité à recevoir des appels.
monitoring Oct 21, 2025, 02:26 PM UTC

A resolution was implemented at 9:45 AM EDT and we are monitoring the results. .
investigating Oct 21, 2025, 04:31 PM UTC

We are currently investigating some additional delayed calls being reported.
monitoring Oct 21, 2025, 04:57 PM UTC

Inbound and outbound calls have been confirmed working since 11:17 EST. We will continue to monitor the situation.
resolved Oct 24, 2025, 05:21 AM UTC

This incident has been resolved.
postmortem Oct 24, 2025, 05:27 AM UTC

# Cloudli Incident Report Services Affected: Clarity Inbound Calling, Clarity Outbound Calling Start Date/Time: 9:00 AM \(EST\), October 21, 2025 End Date/Time: 11:22 AM \(EST\), October 21, 2025 ## Summary of Events At approximately 9:00 AM \(EST\), Cloudli Support received multiple reports from customers indicating failed inbound and outbound calls. The Network Operations and Engineering teams were immediately engaged to investigate. Initial analysis confirmed that the issue was not related to the prior night’s maintenance, which had already been rolled back. Further investigation revealed that the \`shortlocation\` entries in Redis for some Registrar Servers were incomplete, resulting in failed SIP registrations and call setup errors for a subset of customers. At 9:13 AM, the GlobalSBC Registrar microservice was restarted to re-establish proper Redis synchronization. Service behavior normalized immediately following the restart, and call completion success rates returned to expected levels. The incident moved to a resolved but monitoring state at 9:45 AM. By 10:15 AM, engineering confirmed that all SIP registrations were stable and that customers previously impacted were again able to complete calls successfully. Further validation through test calls, log reviews, and metric analysis confirmed full-service restoration by 11:22 AM, closing the incident. ## Incident Analysis and Mitigation Measures The incident on October 21, 2025, was caused by the AWS US-East regional outage \(occurring on October 20, 2025\), which disrupted connectivity to Cloudli’s Kafka infrastructure. This interruption caused transient desynchronization between Kafka, Redis, and the Java microservices managing some Registrar microservices, resulting in incomplete \`shortlocation\` entries and subsequent SIP registration failures. Once AWS service availability returned for Cloudli’s Kafka infrastructure, restarting the affected Registrar component restored proper Redis state and normal call routing. To prevent recurrence, Cloudli Engineering will: * Implement enhanced caching and message queue redundancy, beyond 12 hours to reduce reliance on real-time cloud synchronization. * Expand health monitoring around Registrar nodes to immediately flag Redis desynchronization events. These measures will ensure platform resilience and reduce sensitivity to external cloud infrastructure interruptions. ## Final Remarks At Cloudli, we take any interruption of service very seriously and are continuously evaluating new processes and mitigation measures that can be proactively implemented to ensure service continuity. When service interruptions do occur, our incident management procedure prioritizes prompt and clear notification and timely status and resolution updates to our customers and partners. We thank you for your continued support. Please feel free to reach out if you would like to discuss the particulars of this incident report further.