SYNAQ incident

Cloud Mail Incident - 28/04/2021

SYNAQ experienced a major incident on April 28, 2021 affecting SYNAQ Cloud Mail, lasting 1h 18m. The incident has been resolved; the full update timeline is below.

Started: Apr 28, 2021, 11:38 AM UTC
Resolved: Apr 28, 2021, 12:56 PM UTC
Duration: 1h 18m
Detected by Pingoru: Apr 28, 2021, 11:38 AM UTC

Affected components

SYNAQ Cloud Mail

Update timeline

investigating Apr 28, 2021, 11:38 AM UTC

Dear Clients, SYNAQ Cloud Mail is currently experiencing an incident were certain users are struggling to authenticate and download mail. Engineers are investigating this as a matter of urgency. We will send our next update in 30 minutes.
investigating Apr 28, 2021, 11:47 AM UTC

Dear Clients, SYNAQ Cloud Mail is currently experiencing an incident were certain users are struggling to authenticate and download mail. Engineers are investigating this as a matter of urgency. We will send our next update in 30 minutes.
resolved Apr 28, 2021, 12:56 PM UTC

Dear Clients, The SYNAQ Cloud Mail incident has been resolved and the service has returned to optimal functionality.
postmortem May 11, 2021, 11:32 AM UTC

Summary and Impact to Customers On Wednesday 28th April 2021 from 13:38 to 14:56, SYNAQ Cloud Mail experienced a mail authentication incident. The resultant impact of the event was that users were unable to authenticate could not access the platform. Root Cause and Solution The root cause of this event was due to a failed route on our primary “High Availability” server and the Secondary server not taking over. With the route being down, the authentication requests were unable to reach the Cloud Mail proxy servers and authentication was unable to complete. To resolve this issue, the primary server had to be shutdown to bring the routes back up again. Once the routes had been restored to the Cloud Mail proxy server, authentication requests were once again being processed and users could access their mail. Upon further investigation, it was determined that the Secondary server had a configuration error which prevented it from being able to detect there was a problem with the primary server, which would have allowed it to take over. This configuration has been corrected and failover has since been tested and this moved across seamlessly. Remediation Actions • A full audit of all HA servers to ensure all their rules are identical. • Scheduled failover testing of HA servers to be implemented.