Dyspatch incident

Application and API unavailable

Major Resolved View vendor source →

Dyspatch experienced a major incident on April 8, 2024 affecting API and Dashboard, lasting 13h 41m. The incident has been resolved; the full update timeline is below.

Started
Apr 08, 2024, 07:18 PM UTC
Resolved
Apr 09, 2024, 09:00 AM UTC
Duration
13h 41m
Detected by Pingoru
Apr 08, 2024, 07:18 PM UTC

Affected components

APIDashboard

Update timeline

  1. investigating Apr 08, 2024, 07:18 PM UTC

    Some users may be experiencing availability issues on the Dyspatch application and API. The Dyspatch engineering team is investigating.

  2. identified Apr 08, 2024, 07:32 PM UTC

    The issue has been identified and a fix is being implemented.

  3. identified Apr 08, 2024, 08:10 PM UTC

    We are continuing to work on a fix for this issue.

  4. identified Apr 08, 2024, 08:42 PM UTC

    We are continuing to work on a fix for this issue.

  5. identified Apr 08, 2024, 09:15 PM UTC

    We are continuing to work on a fix for this issue.

  6. identified Apr 08, 2024, 09:57 PM UTC

    We are continuing to work on a fix for this issue.

  7. identified Apr 08, 2024, 10:35 PM UTC

    We are continuing to work on a fix for this issue.

  8. identified Apr 08, 2024, 11:26 PM UTC

    We are continuing to work on a fix for this issue.

  9. identified Apr 09, 2024, 12:18 AM UTC

    We are continuing to work on a fix for this issue.

  10. identified Apr 09, 2024, 01:09 AM UTC

    We are continuing to work on a fix for this issue.

  11. identified Apr 09, 2024, 02:05 AM UTC

    We are continuing to work on a fix for this issue.

  12. identified Apr 09, 2024, 03:06 AM UTC

    We are continuing to work on a fix for this issue.

  13. identified Apr 09, 2024, 04:08 AM UTC

    We are continuing to work on a fix for this issue.

  14. identified Apr 09, 2024, 05:15 AM UTC

    We are continuing to work on a fix for this issue.

  15. identified Apr 09, 2024, 06:21 AM UTC

    We are continuing to work on a fix for this issue.

  16. monitoring Apr 09, 2024, 07:58 AM UTC

    A fix has been implemented and we are monitoring the results. Thank you for your patience.

  17. resolved Apr 09, 2024, 09:00 AM UTC

    This incident has been resolved.

  18. postmortem Apr 16, 2024, 08:22 PM UTC

    # **Post Mortem** - **April 8 2024 Dyspatch Outage Intro** On April 8, 2024, Dyspatch was unavailable between the hours of 12:30PM and 01:00AM Pacific time due to an issue that occurred during a routine upgrade of Dyspatch's infrastructure. This post mortem aims to analyze the root causes of the outage, assess its impact on our services, and outline steps Dyspatch is taking to prevent similar incidents in the future. ## **Timeline \(Pacific Time\)** **11:35 -** We begin the upgrade **12:10 -** The production cluster intermittently returns 503s for users. Dyspatch's services cannot communicate with each other. **12:17 -** We attempt to rollback the changes. **12:30 -** We identify the problem: the internal authentication mechanism our services use to communicate securely is out of sync across services. **12:30 - 17:30 -** We try several strategies to bring production online. **17:30** - To avoid further impact to our production environment, work begins on our staging environment. **18:17 -** We identify that previous changes were made to our staging environment without getting applied to our production environment. **21:16 -** Staging is online. We begin applying the changes from our staging environment to our production environment. **00:56 -** Dyspatch is available again. ## **Why did this happen? What did we learn?** During the outage we ran into several challenges trying to restore service. We discovered that a previous update to a critical component of our infrastructure was applied only to our staging environment. It was quickly determined that the issue was an authentication misalignment between Dyspatch's services which meant that our various services could not communicate with each other. We learned that we did not have a way to generate new credentials without taking the services that manage our cluster offline. After we determined that critical services had to be taken offline we switched to testing on our staging environment to prevent data loss in our production environment. Ultimately a difference in our production and staging environment had knock-on effects affecting our ability to rollback and recover quickly. ## **What are we doing about it?** There are several actions we intend to take to prevent similar issues from happening: 1. We immediately aligned our staging and production environments to ensure that any infrastructure testing done in staging will be the same when applied to our production environment. The root cause of this outage came from a difference in environments and this ensures that we can be confident when testing required infrastructure changes. 2. We plan to invest in tooling to help us automatically catch and audit any drift between our environments. Catching the difference beforehand would have prevented this incident. 3. We are investing in tooling and processes to help us rebuild our cluster more reliably and quickly. We had to spend time migrating changes from our staging environment to our production environment when trying to restore Dyspatch. ## **Summary** Finally, we want to apologize. We know Dyspatch is important for supporting our customers' communications. Your patience and support mean a great deal to us and we appreciate everyone who reached out to our team. Like with any operational issue, we will spend time in the coming days and weeks to understand the details of the event and make improvements mentioned above to our infrastructure and processes.