StarRez incident

Service Disruption - Housing Director

Critical Resolved View vendor source →

StarRez experienced a critical incident on July 19, 2024 affecting The Housing Director Cloud Services, lasting 7h 31m. The incident has been resolved; the full update timeline is below.

Started
Jul 19, 2024, 05:44 AM UTC
Resolved
Jul 19, 2024, 01:15 PM UTC
Duration
7h 31m
Detected by Pingoru
Jul 19, 2024, 05:44 AM UTC

Affected components

The Housing Director Cloud Services

Update timeline

  1. investigating Jul 19, 2024, 05:44 AM UTC

    Housing Director Customer's within all regions are experiencing a service disruption accessing Housing Director and related services. -Engineers are actively reviewing this issue. -Next update expected within the next 1 hour, or as warranted by a change of events.

  2. investigating Jul 19, 2024, 06:34 AM UTC

    StarRez can confirm that the THD platform is currently impacted by a larger Global event relating to the usage of Crowdstrike. Engineers are actively reviewing the situation in an effort to restore service. -Engineers are actively reviewing this issue. -Next update expected within the next 1 hour, or as warranted by a change of events.

  3. identified Jul 19, 2024, 07:28 AM UTC

    Engineers continue to actively review how to stabilize the platform and prevent any further down events. Efforts are being made to mitigate this issue. -Next update expected within the next 1 hour, or as warranted by a change of events.

  4. monitoring Jul 19, 2024, 08:30 AM UTC

    Services have been restored to all Housing Director customers. Engineers are still actively reviewing/monitoring this issue. -Next update expected as warranted by a change of events.

  5. resolved Jul 19, 2024, 01:15 PM UTC

    This incident has been resolved.

  6. postmortem Jul 30, 2024, 11:01 PM UTC

    **StarRez Root Cause Analysis** A global event occurred where an update was distributed by our threat protection vendor which caused blue screen events on all Windows based hosts that received the update. **Root Cause** At 5:05AM UTC, 19th July 2024, the THD Cloud platform was impacted by an update that was distributed by our threat detection vendor. This resulted in a subset of infrastructure experiencing continual blue screen events/reboot loops causing either a complete failure or continual disruption for the underlying host. **Resolution** Multiple remediation practices took place to restore services after mitigation steps were provided from the vendor. A subset of systems recovered automatically once an update was pushed by the vendor, however any remaining hosts impacted required manual intervention to remove the relevant update file to allow the machine to boot successfully. At 7:58AM UTC, 19th July 2024, all THD services were back online **Next Steps** We will conduct post incident reviews to investigate if there are any process improvements we can make when vendors push updates that disrupt service.