StarRez incident

Service Disruption - Mercury

Critical Resolved View vendor source →

StarRez experienced a critical incident on July 19, 2024 affecting North America and EMEA, lasting 12h 1m. The incident has been resolved; the full update timeline is below.

Started
Jul 19, 2024, 05:45 AM UTC
Resolved
Jul 19, 2024, 05:47 PM UTC
Duration
12h 1m
Detected by Pingoru
Jul 19, 2024, 05:45 AM UTC

Affected components

North AmericaEMEA

Update timeline

  1. investigating Jul 19, 2024, 05:45 AM UTC

    Mercury Customer's within all regions are experiencing a service disruption accessing Mercury and related services. -Engineers are actively reviewing this issue. -Next update expected within the next 3 hours, or as warranted by a change of events.

  2. investigating Jul 19, 2024, 06:35 AM UTC

    StarRez can confirm that the Mercury platform is currently impacted by a larger Global event relating to the usage of Crowdstrike. Engineers are actively reviewing the situation in an effort to restore service. -Engineers are actively reviewing this issue. -Next update expected within the next 1 hour, or as warranted by a change of events.

  3. identified Jul 19, 2024, 07:28 AM UTC

    Engineers continue to actively review how to stabilize the platform and prevent any further down events. Efforts are being made to mitigate this issue. -Next update expected within the next 1 hour, or as warranted by a change of events.

  4. identified Jul 19, 2024, 08:33 AM UTC

    StarRez engineers are continuing to work on mitigations to bring stability and service back to the platform. -Engineers are actively reviewing this issue. -Next update expected within the next 1 hour, or as warranted by a change of events.

  5. identified Jul 19, 2024, 10:05 AM UTC

    StarRez engineers are continuing to work on mitigations to bring customers back online. Due to the situation this is proceeding slower than anticipated. -Engineers are actively working on this issue. -Next update expected within the next 1 hour, or as warranted by a change of events.

  6. identified Jul 19, 2024, 11:12 AM UTC

    Services are starting to come online within the EMEA region. -Engineers are actively working on this issue. -Next update expected within the next 1 hour, or as warranted by a change of events.

  7. identified Jul 19, 2024, 11:46 AM UTC

    Services are starting to come online within the North America region. -Engineers are actively working on this issue. -Next update expected within the next 1 hour, or as warranted by a change of events.

  8. identified Jul 19, 2024, 12:52 PM UTC

    A subset of customers continue to remain down in both the EMEA and North America regions. Further work is underway to bring these remaining customers back online -Engineers are actively working on this issue. -Next update expected within the next 1 hour, or as warranted by a change of events. Thank you for you patience StarRez

  9. identified Jul 19, 2024, 02:21 PM UTC

    Restoration efforts for the subset of customers down in North American and EMEA is still ongoing. -Engineers are actively working on this issue. -Next update expected within the next 1 hour, or as warranted by a change of events. Thank you for you patience StarRez

  10. identified Jul 19, 2024, 02:44 PM UTC

    All services have been restored to EMEA customers. -Restoration efforts for the subset of customers down in North American. -Engineers are actively working on this issue. -Next update expected within the next 1 hour, or as warranted by a change of events. Thank you for you patience StarRez

  11. resolved Jul 19, 2024, 05:47 PM UTC

    Services have been restored to all Mercury customers.

  12. postmortem Jul 30, 2024, 10:58 PM UTC

    **StarRez Root Cause Analysis** A global event occurred where an update was distributed by our threat protection vendor which caused blue screen events on all Windows based hosts that received the update. **Root Cause** At 5:05AM UTC, 19th July 2024, the Mercury Cloud platform was impacted by an update that was distributed by our threat detection vendor. This resulted in a subset of infrastructure experiencing continual blue screen events/reboot loops causing either a complete failure or continual disruption for the underlying host. **Resolution** Multiple remediation practices took place to restore services after mitigation steps were provided from the vendor. A subset of systems recovered automatically once an update was pushed by the vendor, however any remaining hosts impacted required manual intervention to remove the relevant update file to allow the machine to boot successfully. At 3:37PM UTC 19th July 2024, all Mercury services were back online **Next Steps** We will conduct post incident reviews to investigate if there are any process improvements we can make when vendors push updates that disrupt service.