Lakeside Software incident

Intermittent Connectivity Issues

Major Resolved View vendor source →

Lakeside Software experienced a major incident on March 18, 2024 affecting SysTrack API/UI and SysTrack Endpoint Connections, lasting 6h 14m. The incident has been resolved; the full update timeline is below.

Started
Mar 18, 2024, 01:45 PM UTC
Resolved
Mar 18, 2024, 08:00 PM UTC
Duration
6h 14m
Detected by Pingoru
Mar 18, 2024, 01:45 PM UTC

Affected components

SysTrack API/UISysTrack Endpoint Connections

Update timeline

  1. investigating Mar 18, 2024, 02:00 PM UTC

    We are currently investigating this issue.

  2. identified Mar 18, 2024, 02:10 PM UTC

    The issue has been identified and a fix is being implemented.

  3. monitoring Mar 18, 2024, 03:49 PM UTC

    A fix has been implemented and we are monitoring the results.

  4. investigating Mar 18, 2024, 05:23 PM UTC

    We are currently investigating this issue.

  5. investigating Mar 18, 2024, 05:24 PM UTC

    We are continuing to investigate this issue.

  6. resolved Mar 18, 2024, 08:00 PM UTC

    A fix has been implemented and we are going to be actively monitoring the system. Due to the nature of the change, resolving to the cloud site might be delayed for some users depending on region and location.

  7. postmortem Apr 04, 2024, 01:58 AM UTC

    # What was the issue? All customers experienced intermittent problems loading the SysTrack CE for a period of time. Agents as well as users of the SysTrack web product were unable to connect for periods of time: # What was the root cause? This is a preliminary RCA that is subject to change once we get the final RCA from Microsoft. Thus far, we have determined the root cause to be with the Microsoft Azure’s Application Gateway. This managed service, fully supported by Microsoft, appears to have had a backend update which was deployed by Microsoft to the different SysTrack regions. This updated version does not appear to handle our unique workload on the Application Gateway. The managed service got overloaded and caused it to go into an _unavailable_ state thus not accepting any inbound traffic. # What is the Prevention Strategy? We are still working with our cloud provider \(Azure\) to get a full RCA before we prepare an identification and prevention strategy.