Singlewire Software incident

Some Fusion servers reporting as disconnected

Notice Resolved View vendor source →

Singlewire Software experienced a notice incident on November 27, 2024, lasting 1h 27m. The incident has been resolved; the full update timeline is below.

Started
Nov 27, 2024, 08:03 PM UTC
Resolved
Nov 27, 2024, 09:30 PM UTC
Duration
1h 27m
Detected by Pingoru
Nov 27, 2024, 08:03 PM UTC

Update timeline

  1. investigating Nov 27, 2024, 08:03 PM UTC

    We have received reports of Fusion servers reporting as disconnected. We are investigating to determine the cause of the disconnected status and what, if any, impact this might have on notification reliability.

  2. identified Nov 27, 2024, 08:18 PM UTC

    We have identified an issue with the service that reports on server state and are working to resolve it. Notification deliverability is not affected.

  3. monitoring Nov 27, 2024, 09:20 PM UTC

    After restarting the affected service, we observe that affected servers are now reporting as being connected. We will continue to monitor for any abnormalities.

  4. resolved Nov 27, 2024, 09:30 PM UTC

    After restarting the affected service, we observe that affected servers are now reporting as being connected. We will continue to work to identify the root cause and prevent similar future issues.

  5. postmortem Dec 20, 2024, 03:37 PM UTC

    The service that determines Fusion server health uses a distributed data store to track the state of each Fusion server in a way that is meant to be resilient to loss of a single node in our system. This service was configured differently from other services in which we use the same technology, and in such a way that the system required data to match on every node in order to process it. As part of a normal system maintenance operation, we replaced several nodes in our system on November 27, including the ones holding this Fusion server health data. Because of this misconfiguration, the system temporarily stopped processing Fusion server health data. To solve the immediate problem, we restarted the service, which put it back into a good working state. We are also planning a longer-term fix correcting the configuration such that future losses of one node cannot result in the same problem occurring, either in this service or other future services using the same technology.