Dstny incident

ConnectMe + SMP - Disconnection Issues, latency

Dstny experienced a major incident on December 16, 2025 affecting EU and EU, lasting 1d 2h. The incident has been resolved; the full update timeline is below.

Started: Dec 16, 2025, 02:03 PM UTC
Resolved: Dec 17, 2025, 04:04 PM UTC
Duration: 1d 2h
Detected by Pingoru: Dec 16, 2025, 02:03 PM UTC

Affected components

EUEU

Update timeline

investigating Dec 16, 2025, 11:12 AM UTC

We are currently investigating a potential incident affecting ConnectMe in the Nordics region. Impact: Suspected latency issues and intermittent disconnections from the server, resulting in no inbound or outbound calls. Our teams are actively working to identify the scope of the issue and will provide updates every 60 minutes as we gather more information. Thank you for your patience as we work to address this matter. Dstny Support
investigating Dec 16, 2025, 12:05 PM UTC

We are currently investigating an issue affecting ConnectMe and SMP in the Nordics region. Impact: Users are experiencing timeout issues, latency, and disconnections on both products. The problem appears to be linked to high CPU utilisation. Our engineering team is actively working to put mitigations in place to restore service. Updates will be provided every 60 minutes as we learn more. We apologise for any inconvenience caused and appreciate your patience during this time. Dstny Support
identified Dec 16, 2025, 01:02 PM UTC

We are still actively mitigating the issue affecting ConnectMe and SMP in the Nordics region. While our engineering team continues to work on a full resolution, it is our expectation that customers should no longer be experiencing any further impact. If this changes, we will update immediately. However, if you encounter any issues whether intermittent or persistent, please raise them with Support urgently. Our next update will follow in one hour. Thank you for your continued patience. Dstny Support
identified Dec 16, 2025, 01:03 PM UTC

We are writing to inform you that an issue occurred today affecting ConnectMe and SMP in the Nordics region. The impact has been mitigated, and we are satisfied that customers should not experience any further disruption. We will continue to monitor the situation for the next 24 hours. If you encounter any issues, whether intermittent or persistent, please contact our Support team immediately. Thank you for your understanding and patience during this time. Dstny Support
identified Dec 16, 2025, 02:07 PM UTC

We are writing to inform you that an issue occurred today affecting ConnectMe and SMP in the Nordics region. The impact has been mitigated, and we are satisfied that customers should not experience any further disruption. We will continue to monitor the situation for the next 24 hours. If you encounter any issues, whether intermittent or persistent, please contact our Support team immediately. Thank you for your understanding and patience during this time. Dstny Support
resolved Dec 17, 2025, 04:04 PM UTC

We are writing to inform you that the issue affecting ConnectMe and SMP in the Nordics region yesterday has been mitigated, and we are satisfied that customers should not experience any further disruption. We have completed 24 hours of monitoring, and the incident is now considered resolved. We initially committed to providing a detailed post-mortem report within 5 business days. However, due to the complexity of this incident and the upcoming Christmas leave period, we require additional time to complete a comprehensive analysis. This will ensure the root cause is fully understood and that suitable preventative actions are identified to avoid recurrence. We will share the report as soon as it is finalised and appreciate your patience as we work to deliver thorough insights and preventative measures. We sincerely apologise for any inconvenience caused and thank you for your understanding and support throughout this incident. If you have any immediate questions or concerns, please reach out to our Support team. Dstny Support
postmortem Jan 07, 2026, 11:23 AM UTC

**Incident Summary** On 16th December 2025 between 10:35 and 12:56 UTC, users of ConnectMe and SMP services in the Nordics‑STO production environment experienced widespread connection failures. A sudden surge in user disconnections and reconnections created an unexpected spike in system load, particularly on WebRTC components and network nodes. Rate-limiting protocols were initiated but the volume overwhelmed the cluster and prevented it from self‑recovering. As more users attempted to reconnect, CPU usage increased further, slowing key components and preventing successful registration to the ConnectMe backend. The incident was resolved at 12:56 UTC following scaling actions, configuration adjustments, and the introduction of additional rate‑limiting measures. **Root Cause** The network experienced a high number of user disconnects and reconnection attempts within a very short period of time. Each reconnect triggered a resource‑intensive validation step on the backend. Because the number of configured endpoints per WebRTC component was higher than previously tested, this caused CPU usage to rapidly increase across several nodes. Due to a limitation in the rate-limiting protocols, the system accepted more reconnection attempts than it could handle, compounding the load and slowing down essential components. This created a feedback loop where users attempting to reconnect further increased CPU pressure, preventing the system from stabilising. **Incident Resolution** Service availability was immediately stabilised by redeploying WebRTC pods to clear outdated endpoint data, increasing CPU and memory allocations, and adding an extra service node to better distribute load. An emergency fix was also applied to remove a bottleneck and introduced stronger rate‑limiting to control the surge of reconnection attempts, while scaling additional WebRTC components to handle the increased traffic. These combined measures improved registration rates, reduced CPU pressure, and restored normal service by 12:56 UTC. **Mitigative Actions** * Strengthen WebRTC scalability and enforce effective rate‑limiting across all layers. * Reduce the number of endpoints per WebRTC pod based on updated performance testing and add proactive alerts for rapid endpoint growth. * Improve retry and rate‑limiting behaviour within ConnectMe and related APIs.