StarRez incident
Service Disruption - Core Services - Asia Southeast
StarRez experienced a major incident on February 7, 2023 affecting Email Sending, lasting 2d 5h. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Feb 07, 2023, 08:47 PM UTC
Customers in Asia Southeast are experiencing a service disruption with some Core Services -Engineers are actively working to remediate the issue. -Next update expected within 60 minutes, or as warranted by a change of events.
- identified Feb 07, 2023, 08:49 PM UTC
The upstream provider has identified they are having issues in this region -Engineers are actively working to remediate the issue. -Next update expected within 60 minutes, or as warranted by a change of events. Apologies for any inconvenience, StarRez Team
- identified Feb 07, 2023, 10:36 PM UTC
Confirmed this is a Microsoft outage in the datacenter in this region. Will provide updates as they are provided to us
- identified Feb 07, 2023, 11:25 PM UTC
A workaround is current being investigated within the hope to restore services. - Engineers are actively working to remediate the issue. - Next update expected within 60 minutes, or as warranted by a change of events. Apologies for any inconvenience, StarRez Team
- identified Feb 08, 2023, 01:32 AM UTC
The workaround has helped to bring a subset of sites back online. Work continues with the remaining sites to restore service. There continues to be no ETA for restoration of services within this region from our vendor. -Engineers are actively working to remediate the issue. -Next update as warranted by a change of events. Apologies for any inconvenience, StarRez Team
- identified Feb 08, 2023, 05:07 AM UTC
There continues to be no ETA for restoration of services. Our vendor has advised that restoration works are still underway within the impacted region -Engineers are actively working to remediate the issue. -Next update as warranted by a change of events. Apologies for any inconvenience, StarRez Team
- identified Feb 08, 2023, 06:05 AM UTC
There continues to be no ETA for restoration of services from our vendor. The customer dev/test environments impacted by this will continue to sustain an outage at this time. Engineers are actively reviewing if the DR process should be engaged on these sites should the outage remain ongoing. - Engineers are actively working to remediate the issue. - Next update expected within 60 minutes, or as warranted by a change of event
- identified Feb 08, 2023, 11:09 AM UTC
There continues to be no ETA for restoration of services from our vendor. Engineers are currently failing over these remaining customers to a functional region to bring services back online. - Engineers are actively working to remediate the issue. - Next update expected within 60 minutes, or as warranted by a change of event
- identified Feb 08, 2023, 01:02 PM UTC
All production sites are now back online. The remaining Development sites are being worked on. There continues to be no ETA for restoration of services within the Southeast Asia region. from our vendor. -Engineers are actively working to remediate the issue. -Next update as warranted by a change of events.
- monitoring Feb 08, 2023, 01:33 PM UTC
All customers are now back online after successfully failing over core resources. Engineers will continue to monitor this situation closely before closing out. - Next update as warranted by a change of events.
- resolved Feb 10, 2023, 02:20 AM UTC
The incident within this region has been resolved. StarRez will work continue to monitor for stability to determine when it is safe to failback resources into the region.
- postmortem Mar 03, 2023, 02:07 AM UTC
**Southeast Asia Outage – 7th February 2023** A cooling failure within our upstream vendors datacenter brough down a subset of services; storage accounts and SQL backend. This required StarRez to implement DR processes to bring customer sites back online. **Root Cause** A cooling failure within our upstream vendors datacenter forced a shutdown of all storage and compute resources within this zone to protect our data. This impacted a subset of services for customers in this region, primarily storage accounts and SQL backends. **Resolution** A subset of services were re-provisioned within a functional zone within the region to bring a subset of customers back online. All remaining impacted services were failed over as part of the StarRez DR process to the Asia East region. Once the upstream vendor had brought all services back online within the region and StarRez was comfortable with stability, all services were moved back into the Southeast region. **Additional Information** In follow-up to this incident, StarRez have reviewed and updated our DR process in response to this incident to ensure quicker recovery should similar incidents occur in the future.