Dstny incident

ConnectMe - Nordic production - Users unable to login

Dstny experienced a critical incident on August 21, 2025 affecting EU, lasting 19h 38m. The incident has been resolved; the full update timeline is below.

Started: Aug 21, 2025, 02:20 PM UTC
Resolved: Aug 22, 2025, 09:59 AM UTC
Duration: 19h 38m
Detected by Pingoru: Aug 21, 2025, 02:20 PM UTC

Affected components

Update timeline

investigating Aug 21, 2025, 02:20 PM UTC

We are currently investigating a potential incident affecting ConnectMe in the Nordics production region. Our teams are actively working to identify the scope of the issue and we will provide updates every 60 minutes as we gather more information. Thank you for your patience as we work to address this matter. Dstny Support
monitoring Aug 21, 2025, 02:51 PM UTC

Our Engineering team have identified the root cause, and implemented actions to restore services to ConnectMe. Users will need to restart the ConnectMe application before attempting to log in. We continue to monitor service availability closely, and will provide a further update in the next 2 hours.
monitoring Aug 21, 2025, 03:32 PM UTC

Our Platform team has identified the root cause of the issue and implemented corrective measures to restore application services. We will continue to monitor service availability for the next 24 hours. and do not anticipate any further impact at this time. Thank you. Dstny Support.
resolved Aug 22, 2025, 09:59 AM UTC

We’re pleased to confirm that this incident has been fully resolved. Over the past 24 hours, we have closely monitored the platform and observed no recurrence or further impact. The root cause has been identified, and measures have been implemented to prevent a similar issue from occurring in future. To provide transparency and insight, a detailed post-incident report will be made available within the next five working days. We sincerely apologise for any inconvenience caused and appreciate your patience and understanding throughout this incident. If you have any further questions or concerns, please don’t hesitate to contact our support team. Kind regards, Dstny Support
postmortem Oct 16, 2025, 09:34 AM UTC

**Major Incident Category** Service Outage **Post Mortem Owner** Ant Hurlock **Date Post Mortem Completed \(UTC\)** 28 Aug 2025, 13:45 **Incident Summary** On 21st August 2025 at 13:56 UTC, a configuration error occurred during a routine maintenance activity involving the addition of a storage disk. This caused the storage platform to become overloaded, leading to system unresponsiveness and degraded performance. This led to a temporary outage for ConnectMe users in the Nordics region, while Analytics services experienced partial disruption, preventing data synchronisation. Recovery actions began at 14:30 UTC, including the removal of the newly added disk to stabilise the system. Full service was restored by 15:37 UTC, with no data loss reported. **Root Cause** The issue originated from an incorrect implementation during a standard infrastructure change to expand storage capacity. A recent update to the storage platform introduced undocumented changes to the disk integration process, requiring additional steps not present in previous versions. These changes were not known to the engineering team at the time. As a result, the disk addition triggered a network topology conflict between storage clusters. This caused unexpected system strain, leading to elevated CPU usage due to repeated timeouts and retries. The platform reached its operational limit, which in turn degraded performance and disrupted services. This disruption impacted service availability, resulting in a loss of service for all ConnectMe users in the Nordics region. Analytics services experienced a partial disruption, where users were unable to synchronise new data, although existing data remained accessible and the application itself remained stable throughout the incident. **Incident Resolution** To resolve the issue, a recovery plan was agreed at 14:30 UTC to remove the newly added disks from the storage platform. This action enabled the system to automatically rebalance and gradually return to normal operating levels. As a result, service performance stabilised, and full functionality was restored by 15:37 UTC. No data loss occurred during the incident. **Mitigative Actions** To reduce the risk of recurrence, several actions are underway: * Validation of new storage platform versions will be formally embedded into internal processes to ensure compatibility and stability before deployment. * A focused review of the Nordics storage expansion will help identify contributing factors and improve planning for future changes. * The process for adding storage disks has been reclassified from a standard to a normal change, allowing for greater oversight and risk management during infrastructure maintenance involving capacity adjustments. These steps are designed to strengthen platform resilience and ensure service continuity going forward. ### **Timeline**