Zonos incident

2023-06-13 Outage - Dashboard

Critical Resolved View vendor source →

Zonos experienced a critical incident on June 13, 2023 affecting Dashboard, lasting 24m. The incident has been resolved; the full update timeline is below.

Started
Jun 13, 2023, 07:25 PM UTC
Resolved
Jun 13, 2023, 07:49 PM UTC
Duration
24m
Detected by Pingoru
Jun 13, 2023, 07:25 PM UTC

Affected components

Dashboard

Update timeline

  1. investigating Jun 13, 2023, 07:25 PM UTC

    We are currently investigating reports of a potential service interruption with Dashboard. We apologize for any inconvenience and will post another update as soon as we learn more.

  2. investigating Jun 13, 2023, 07:27 PM UTC

    We are continuing to investigate this issue.

  3. identified Jun 13, 2023, 07:37 PM UTC

    An issue with upstream Lambda creation and execution has been identified, and we are waiting on a fix to be rolled out while investigating other mitigation strategies. For more information, see the AWS status at https://health.aws.amazon.com/health/status.

  4. monitoring Jun 13, 2023, 07:46 PM UTC

    A fix has been implemented and we are monitoring the results.

  5. resolved Jun 13, 2023, 07:49 PM UTC

    This incident has been resolved.

  6. postmortem Jun 13, 2023, 10:55 PM UTC

    **What products were affected and what was the impact?** Zonos Dashboard Impact: CRITICAL **What timeframe did this issue occur?** | **Date** | **Time** | | --- | --- | | Jun 13, 2023\] | 12:54 to 13:46 MDT | **How was the issue detected?** Internal reports of authorization failures and Dashboard becoming inaccessible. ‌ **What functionality was affected?** Zonos Dashboard was not accessible. ‌ **What problems did this cause?** Users were unable to access Dashboard to complete tasks. ‌ **What was the resolution of the problem and steps that are being taken for continued follow-up?** The issue was identified as an AWS Operational issue in the US-EAST-1 Region impacting an upstream service provider hosting our Front-End services for Dashboard. We were able to redeploy those services to an unaffected region to restore functionality. ‌ **What mitigation solutions will we put in place to prevent this issue from occurring in the future?** We are continually assessing and improving business continuity solutions throughout every layer of our tech stack to minimize downtime and automate recovery where possible.