Empist incident

Exchange Connectivity Issues

Empist experienced a minor incident on February 18, 2015, lasting 1d 4h. The incident has been resolved; the full update timeline is below.

Started: Feb 18, 2015, 02:37 PM UTC
Resolved: Feb 19, 2015, 07:31 PM UTC
Duration: 1d 4h
Detected by Pingoru: Feb 18, 2015, 02:37 PM UTC

Update timeline

investigating Feb 18, 2015, 02:37 PM UTC

We have received some reports that users are sporadically unable to connect to Exchange through Outlook. Our team is currently investigating the issue.
identified Feb 18, 2015, 03:13 PM UTC

We have identified the issue and working to resolve the problem.
identified Feb 18, 2015, 03:40 PM UTC

This is an issue with one server in the cluster so not all users are experiencing issues. Our team is still working on restoring connectivity to the problem server.
monitoring Feb 18, 2015, 04:18 PM UTC

All of the databases have now been mounted. Outlook should reconnect in the next 5 to 10 minutes. It may become unresponsive for the first few minutes when you connect while all of the email is being updated. Our team will continue to monitor and update accordingly.
monitoring Feb 18, 2015, 05:13 PM UTC

We are once again receiving reports with connectivity issues with the same database. The team is working on it.
monitoring Feb 18, 2015, 06:37 PM UTC

We don't have an ETA just yet but the team is still working on the issue.
monitoring Feb 18, 2015, 07:53 PM UTC

The databases are mounted. We have not opened access yet because we are verifying the integrity of the databases to ensure no email is missing. We will provide another update soon.
monitoring Feb 18, 2015, 08:25 PM UTC

You can now access email through Outlook Web App using https://mail.myhostedservers.net. Use your email address as the username.
monitoring Feb 18, 2015, 10:16 PM UTC

We will need to take emergency maintenance on the one server to address the issue from today and to avoid the same issues from occurring in the future. This maintenance is not system wide and will not effect all users. It will only effect the users who experienced problems today. We will begin this maintenance at 5:00PM CT today to give us enough time to complete before business hours tomorrow.
monitoring Feb 18, 2015, 11:20 PM UTC

The maintenance has started and another update will be provided mid-way through the process.
monitoring Feb 19, 2015, 04:13 AM UTC

The maintenance is progressing as expected. Another update will be provided as we continue to make progress. We expect to have the mailboxes online by 4:00 CST.
monitoring Feb 19, 2015, 09:14 AM UTC

Updated ETR is 5:30AM CST. Another update will be provide when the maintenance is complete.
monitoring Feb 19, 2015, 10:18 AM UTC

The process reached 95% and has failed. Our team is currently working on it. ETR is not available at this time.
monitoring Feb 19, 2015, 11:43 AM UTC

Our team can setup an email forward for you to a personal email account until access has been restored. If you would like a forward added, please contact our team at 312.445.2124 Option 4. We will continue to provide updates to this page as updates are available.
monitoring Feb 19, 2015, 03:10 PM UTC

The maintenance is complete and all mailboxes are online. We will remove any email forwards that were setup if you called in. All email that was queued will also be delivered. No email was lost during this downtime. Our team will continue to monitor.
resolved Feb 19, 2015, 07:31 PM UTC

It has been over 4 hours since services have been restored and there have been no issues reported. This incident will now be closed. Outage details will be available by COB tomorrow. If you would like a detailed outage report, please email [email protected].
postmortem Aug 01, 2018, 07:36 PM UTC

Summary On Wednesday, 2/18/15, at 8:32 am, we received alerts of connectivity issues to our Hosted Exchange platform. Our engineering team began investigating the problem immediately. We then identified that a server in our cluster had become non-responsive and stopped allowing connections to mailboxes that resided on that server. Users experienced Outlook disconnects and subsequently were not receiving email. Based on the architecture of our Hosted Exchange Platform, this was not a complete outage of all mailboxes, but only mailboxes that were located on the problem server. During this outage, we queued all incoming email for the user mailboxes that were impacted so that no emails were lost. Timeline 2/18/15 8:32 AM – Issues were identified through proactive monitoring. 8:37 AM– Announcement of outage. 9:13 AM – Identified root cause. Team was working on this issue. 9:40 AM – We provided updates. 10:18 AM – We believed the issues was resolved but the issues persisted and we continued working it. 11:13 AM – Provided update. 12:37 PM - Provided update. 1:53 PM – Provided update. Databases were mounted and we were verifying the integrity of databases. 2:25PM – Opened access to email via Outlook Web Access. However further monitoring confirmed that connectivity was still sporadic. 4:15 PM – We announced an emergency maintenance to rectify the problem which was scheduled for 5 pm CT. 5:20 PM – Update was provided that maintenance had begun and the mailboxes on that server were completely unavailable. 10:13 PM – Update where we provided an ETA on completion of maintenance for 4 am CT. 2/19/153:14 AM– Update that ETR was pushed to 5:30 CT. 4:18 AM – Maintenance process failed. 5:43 AM – We offered email forwarding as ETR was not available. 9:10 AM – All services were restored. Team continued to monitor. 1:31 PM – We closed the incident as no further issues were detected or reported. Root Cause The issue experienced was a result of a “bug” in the backup software which under normal operating conditions creates a single snapshot and then removes the snapshot after a successful backup operation. As a result of the “bug”, instead of creating 1 snapshot it triggered a loop which created a total of 75 snapshots and did not properly remove them. The net effect of 2 snapshots on a server can create server degradation. The impact of 75 snapshots on the server caused the server performance to deteriorate until it was not operational. Resolution / Remediation We reviewed all possible scenarios to restore mailbox access in the most timely manner possible while maintaining the integrity of the data. Each scenario yielded extensive downtime. Failing over to another server wasn't an option not due to insufficient resources but due to this being an application issue. Backups were available and a possibility to restore, but in order to maintain the integrity of the data and prevent email loss, our team determined that the best course of action was to work through extensive recovery. Preventative Measures In order to prevent an outage like this from occurring in the future, we will be taking the following actions: 1. Apply a patch to backup software to fix the “bug” 2. Complete review of backup architecture 3. Configure additional servers in the cluster to spread mailboxes across additional servers. 4. Deploy additional database availability groups (DAG) to provide additional application redundancy. We understand that email is the central form of communication for all businesses and that this outage presented challenges to all staff members who were impacted. This was an anomaly that under normal circumstances would not occur. Please accept our sincerest apologies for any negative impact on your businesses and please know that we will continue to improve and audit our infrastructure to ensure that outages like this do not occur again.