SEDNA incident

Inability to access Sedna - customers on the Dublin3 Cluster

Major Resolved View vendor source →

SEDNA experienced a major incident on April 23, 2024 affecting SEDNA - www.sednanetwork.com, lasting 5d 20h. The incident has been resolved; the full update timeline is below.

Started
Apr 23, 2024, 02:21 PM UTC
Resolved
Apr 29, 2024, 10:45 AM UTC
Duration
5d 20h
Detected by Pingoru
Apr 23, 2024, 02:21 PM UTC

Affected components

SEDNA - www.sednanetwork.com

Update timeline

  1. investigating Apr 23, 2024, 02:21 PM UTC

    We are aware of an issue affecting a number of customers on a specific infrastructure cluster (Dublin 3). Affected end users are unable to log into Sedna. We are investigating this as an emergency.

  2. monitoring Apr 23, 2024, 02:33 PM UTC

    An issue has been detected in the database layer for the customer tenants affected in Dublin 3. The team has resolved the immediate issue. We are continuing to investigate the root cause. Customers should now have access to Sedna.

  3. monitoring Apr 24, 2024, 09:40 AM UTC

    We are continuing to monitor for any further issues.

  4. monitoring Apr 24, 2024, 09:40 AM UTC

    We are continuing to monitor for any further issues.

  5. resolved Apr 29, 2024, 10:45 AM UTC

    This incident has been marked as resolved. For more details on this incident, see the linked Postmortem.

  6. postmortem Apr 29, 2024, 10:46 AM UTC

    `Post Incident Report` `Dublin 3 Cluster Application Incident` ‌ Date of Issue 23 April 2024 Incident Reference INC-20240423-1418 ‌ # `01 `Summary ‌ On the 23rd April 2024 , the Sedna Platform experienced a partial loss of service. A small number of customers were affected by an application outage and were therefore unable to send or receive emails for a period of approximately 15 minutes. This is not the level of quality and reliability we strive to offer you, and we are taking immediate steps to improve the platform’s availability. We have conducted an internal investigation and are taking steps to improve our service. ‌ ‌ # `02 `Detailed description ‌ At approximately 14:08 UTC on 23 April 2024 Sedna incurred a restart of one of its primary applications \(Node\). The restart was an automated action triggered by the system as a result of an unhealthy state, and resulted in a period of 15 minutes of downtime while the restart completed. ‌ The reason for the restart was related to a memory issue with the service, combined with an extraordinarily high workload on the system. The workload caused a backup in requests, which eventually exhausted the system memory and triggered a restart. The restart itself is an automated action that allows the system to respond to such an event and recover quickly, however during the restart process systems can be unavailable. ‌ The first customer case surfacing symptoms of an application outage was raised with Customer Support at 14:13 UTC, at which point the Sedna Support team triggered a major incident with the engineering team to urgently investigate the issue. The Sedna team deployed a Status Page notification at 14:21 UTC notifying all Status Page followers of the incident under investigation. ‌ The service was fully operational at 14:23 UTC and all customers who reported an incident were informed of the incident closure on the same day. ‌ # `03 `Remediation and Prevention An incident of this nature receives Sedna’s highest level of scrutiny to ensure we can provide our customers with full confidence in the system. Following the incident the team conducted a retrospective to review the remediation taken and to detail next steps to ensure prevention of similar issues occurring in the future. See below the Remediation and Prevention details: ‌ * Remediation: * As of this report the issue itself has been fully resolved * Engineering has conducted a full review of related code - logs and telemetry data, to reduce the likelihood of follow on issues. ‌ * Prevention: We have put the following changes in effect to reduce the likelihood of the issue from recurring: * Engineering has provisioned additional instances of the impacted application to help handle similar unexpected spikes in the future * Engineering has increased the memory size of the Node Application to add additional coverage for similar unexpected spikes in the future ‌ # `04 `What you can expect from SEDNA ‌ ‌ We understand the critical nature of the services SEDNA provides your business. We will continue to communicate with customers to answer any questions and ensure we do our best to provide a seamless customer experience. We apologize for any issues these events may have caused. ‌ Please reach out directly to SEDNA Support \([[email protected]](mailto:[email protected])\) with any questions.