TOPdesk incident

RESOLVED: UK1 Performance Issues 18/06

TOPdesk experienced a critical incident on June 18, 2024 affecting UK1 SaaS hosting location, lasting 1h 22m. The incident has been resolved; the full update timeline is below.

Started: Jun 18, 2024, 09:27 AM UTC
Resolved: Jun 18, 2024, 10:50 AM UTC
Duration: 1h 22m
Detected by Pingoru: Jun 18, 2024, 09:27 AM UTC

Affected components

UK1 SaaS hosting location

Update timeline

investigating Jun 18, 2024, 09:27 AM UTC

We are currently experiencing problems on our UK1 hosting location. As a result, outgoing mails may be impacted and and some users could also experience performance issues. We are aware of the problem and are working on a solution. Our apologies for the inconvenience. At the time of writing this we are not able to give you an estimate on when your environment will be available. The current status can be found on our TOPdesk Status Page: https://status.topdesk.com To inform TOPdesk you are affected by this issue, please visit https://my.topdesk.com/tas/public/ssp/ . Please refer to incident TDR24 06 5497.
monitoring Jun 18, 2024, 10:25 AM UTC

We are pleased to report that the performance issues and disruptions to outgoing mails at our UK1 hosting location have been addressed. As of now, our customers are not experiencing any performance issues or disruptions with outgoing email services. All functionalities have returned to their normal operating conditions. We will continue to monitor the situation closely for an extended period to ensure the stability. Next update will be in 15 minutes.
resolved Jun 18, 2024, 10:50 AM UTC

We are pleased to inform you that all functionality at our UK1 hosting location has remained stable, and we are now considering this issue resolved. In order to improve our system, a thorough investigation into the root cause of this incident will follow. The findings and any steps taken to prevent a recurrence will be shared in an upcoming Root Cause Analysis (RCA) report on our status page. We apologise for any inconvenience this incident may have caused
postmortem Aug 07, 2024, 12:30 PM UTC

The major incident was most likely caused by the Audittrail Service overloading the server, which also hosted the Gatekeeper Service responsible for authorization. The increased load from the Audittrail Service led to significant performance issues, including service logging errors, slow response times, and login failures. The root cause is believed to be the shared resource constraint between the Audittrail Service and Gatekeeper Service on the same machine. The incident was resolved as the load on the server most likely balanced itself. To prevent future occurrences, the plan involves separating the databases by leveraging Azure infrastructure, an ongoing process with high priority.