StarRez incident

Service Disruption - Central US

StarRez experienced a critical incident on July 18, 2024 affecting Central US, lasting 14h 27m. The incident has been resolved; the full update timeline is below.

Started: Jul 18, 2024, 10:48 PM UTC
Resolved: Jul 19, 2024, 01:15 PM UTC
Duration: 14h 27m
Detected by Pingoru: Jul 18, 2024, 10:48 PM UTC

Affected components

Central US

Update timeline

investigating Jul 18, 2024, 10:48 PM UTC

Customers in the Central US are currently experiencing a service disruption -Engineers are actively working to remediate the issue. -Next update expected within 60 minutes, or as warranted by a change of events. Apologies for any inconvenience, StarRez Team
investigating Jul 18, 2024, 11:00 PM UTC

We have confirmed the Service Disruption in Central US is region wide and at a network layer. -Engineers are actively working with our upstream provider to remediate the issue. -Next update expected within 60 minutes, or as warranted by a change of events.
investigating Jul 18, 2024, 11:42 PM UTC

We've been able to mitigate a subset of customers which are now back online. -We are currently verifying stability for these customers. -All remaining customers StarRez Engineers are actively working with our backend provider to remediate the issue. -Next update expected within 60 minutes, or as warranted by a change of events.
identified Jul 19, 2024, 12:38 AM UTC

Disaster recovery is being initiated for the remaining customers offline. -Next update expected within 60 minutes, or as warranted by a change of events.
identified Jul 19, 2024, 02:01 AM UTC

We're seeing instability again as the backend vendor is applying mitigations in the region. Disaster Recovery is underway for the customers impacted. -Next update expected within 60 minutes, or as warranted by a change of events.
monitoring Jul 19, 2024, 02:45 AM UTC

All customer sites are back online. No disaster recovery was initiated. We will continue monitoring the situation
resolved Jul 19, 2024, 01:15 PM UTC

This incident has been resolved.
postmortem Jul 30, 2024, 01:09 AM UTC

**StarRez Root Cause Analysis** StarRez's upstream vendor experienced issues within storage and compute infrastructure, resulting in backend database and compute services becoming inaccessible. **Root Cause** At 10:36PM UTC, 18th July 2024, our upstream vendor experience issues within storage clusters and compute resources in the Central US region. This resulted in connectivity to a subset of backend database and compute infrastructure being lost. **Resolution** At 11:25PM UTC, 18th July 2024, a subset of customer services were restored as load was moved to functional compute infrastructure by StarRez engineers. The remaining customer base remained impacted due to connectivity issues with the backend database infrastructure. At 2:40AM UTC, 19th July 2024, backend database connectivity was restored by our vendor and all remaining customer services that were impacted came online. **Next Steps** A review of redundancy in the region will occur to determine if any adjustments can be made to improve resilience.