HelloID incident

Slowness on West Europe cluster

HelloID experienced a minor incident on June 11, 2024 affecting Service Automation and Access Management, lasting 15h 23m. The incident has been resolved; the full update timeline is below.

Started: Jun 11, 2024, 08:38 PM UTC
Resolved: Jun 12, 2024, 12:01 PM UTC
Duration: 15h 23m
Detected by Pingoru: Jun 11, 2024, 08:38 PM UTC

Affected components

Service AutomationAccess Management

Update timeline

investigating Jun 11, 2024, 08:38 PM UTC

The cache on the West Europe cluster is not automatically being restored in a fashionable manner. This can cause slowness while using HelloID. When an error occurs, please retry, as it's likely to fix the error.
monitoring Jun 12, 2024, 12:05 AM UTC

A fix has been implemented, and we are monitoring the results. Be aware! Automation task incidents using the incident bus are not registered and sent.
identified Jun 12, 2024, 08:32 AM UTC

We've come to the conclusion the Service Automation tasks are not processed as expected. We have identified the cause and are working on a solution.
monitoring Jun 12, 2024, 09:51 AM UTC

The upscaling has handled the extra temporary load. The logging shows the situation is normalizing and pending tasks are executing. We keep monitoring the environments.
resolved Jun 12, 2024, 12:01 PM UTC

This incident has been resolved.
postmortem Jun 12, 2024, 01:44 PM UTC

**What happened?** Since the release of 2024.06 on 10th of June 2024 the process of writing status updates from Automation tasks to HelloID failed intermittently. We've been alerted by our monitoring system and started investigation on 11-6-2024. Our initial analysis showed that the API endpoint responsible for handling Automation Task progress \(progress of the task but also the logging coming from the HelloID Directory Agent\) had some issues, but we couldn't reproduce it straight away. ‌ Around 19.00 UTC\+2 we were alerted that some tenants had a full SQL database, and we had to mitigate that and quickly discovered that this was also the case on West US and could link this to the problem regarding the Automation Task progress handling. ‌ **Mitigation steps** We identified the issue in the code and decided to roll back that change that caused the issue, after testing this on our test and preview environment we've shipped it to West US and after some checks we've deployed it to West Europe. West US showed recovery on that API endpoint and all seemed good. West Europe initially also showed recovery but a short time after that some health checks started failing. This was due to the backlog of Automation Task status handling that the agents had in their backlog, and they started informing HelloID of past Automation Task events that still needed to be reported. This caused a mass spike in calls to another system of ours, which got flooded and wasn't able to keep up. We needed to disable these calls on the West Europe deployment for now to get all systems healthy again, we did so around 02.00 UTC\+2 on 12th of June 2024. After this mitigation the systems came back up and all health checks showed healthy status and our testing also showed that systems were working as expected once again. ‌ Early in the morning a queue in the Service Automation system began to fill up with a lot of messages, and we've received alerts from our monitoring system, and we directly started investigating. The system was working, but still needed to process a lot of the backlog of the last night. We've mitigated this issue by scaling up and resolved these issues at 11.45 UTC\+2. ‌ **Future steps** We're planning on taking the following action to make this incident less likely to occur in the future: * Improve the Automation Task status update processing * Improve the communication between HelloID and our internal incidents system * Improve SA alerting regarding the queue stats