House of Control incident
[Complete Control] Trouble logging in for our Danish customers
House of Control experienced a critical incident on April 25, 2022 affecting Complete Control, Region - Denmark, lasting 1d 1h. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Apr 25, 2022, 11:11 AM UTC
Some users may be experiencing trouble when logging in to Complete Control Denmark. Our operations team is currently investigating issues related to login. We will send an additional update in 15 minutes.
- identified Apr 25, 2022, 11:25 AM UTC
We have identified the issue and danish customers should again be able to login. We will continue to investigate.
- monitoring Apr 25, 2022, 01:00 PM UTC
The service seems to have stabilized and danish customers should to be able to login as usual. But we will continue to monitor the service.
- resolved Apr 26, 2022, 12:22 PM UTC
We have been monitoring the service for 24h with no new issues. The root cause of the incident was a utility service causing resource exhaustion. A fix for the service has been deployed reducing the chance of similar events in the future, as well as new monitoring enabling our operations team to catch the event before any limit has been reached.
- postmortem May 02, 2022, 10:12 AM UTC
## Timeline * 2022-04-25 12:28 Last successful requests, after this point customers were not able to login. * 2022-04-25 12:59 Customer experiences login issues and contacts support. Customer Success creates an internal incident-ticket in the Incident management system * 2022-04-25 13:01 The Operations team and developers start working on the case. * 2022-04-25 13:11 The Operations team opens an incident on [statuspage.io](http://statuspage.io) * 2022-04-25 13:12 The Operations team restarts the service and resources are freed up. The customers are from now able to login. * 2022-04-25 13:25 Statuspage.io incident is updated with status Identified * 2022-04-25 13:30 The internal incident-ticket is updated to review, and information is shared regarding status * 2022-04-25 13:30 The Operations team is monitoring the processes * 2022-04-25 14:38 The affected service is restarted * 2022-04-25 14:40 The affected service is again running * 2022-04-25 15:00 Statuspage.io incident is updated with status Monitoring * 2022-04-26 14:22 Statuspage.io incident is updated with status Resolved ### Cause * The limit for processes was reached, resulting in no new processes getting through. This was caused by a bug in a utility-service ### Resolution * A fix made in Complete Control made for requests to time out, this will not cause them to hold processes for longer than a fixed time * Monitoring of processes will allow operations to catch any irregularities before resources are exhausted ### Measures * We will improve monitoring on selected endpoint to make sure they always respond with the correct status code