Kalix EMR incident

Degraded Performance

Kalix EMR experienced a minor incident on March 11, 2021 affecting Kalix Platform and Telehealth and 1 more component, lasting 9h 46m. The incident has been resolved; the full update timeline is below.

Started: Mar 11, 2021, 07:04 PM UTC
Resolved: Mar 12, 2021, 04:50 AM UTC
Duration: 9h 46m
Detected by Pingoru: Mar 11, 2021, 07:04 PM UTC

Affected components

Kalix PlatformTelehealthOnline SchedulersMessagingNotifications

Update timeline

investigating Mar 11, 2021, 07:04 PM UTC

Kalix has been experiencing intermittent periods of instability. We are currently investigating the underlying cause and will provide ongoing updates.
investigating Mar 11, 2021, 07:20 PM UTC

We have had 3 sudden timeout issues with our database this morning, one lasting 15 minutes and another two lasting 5 minutes, seperated by about 30minutes - 1 hour. We are attempting to work out the underlying cause of the timeouts. We are also going to be trying to reduce the load on the database by removing some secondary queries, this will not change any functionality for Kalix users.
investigating Mar 11, 2021, 07:44 PM UTC

We have just released a possible fix for this issue. We are monitoring Kalix to see that update solves the problem.
monitoring Mar 11, 2021, 07:46 PM UTC

A fix has been implemented and we are monitoring the results.
investigating Mar 11, 2021, 07:58 PM UTC

Unfortunately, the cause of the problem was not what we thought it was. There was another downgrade to Kalix due to the server becoming unresponsive. We are currently investigating another possible cause.
investigating Mar 11, 2021, 10:20 PM UTC

We haven't seen another spike in timeouts for a number of hours now, but we will continue to monitor. Unfortunately we still have not been able to determine the root cause of the underlying problem. The investigation continues.
identified Mar 12, 2021, 03:51 AM UTC

Our engineers have managed to find a possible cause for this issue, we are currently working on a permanent fix to prevent it from happening again. Unfortunately when testing this caused another overload of the server to happen, with a 5-10 minute downtime.
identified Mar 12, 2021, 03:59 AM UTC

Servers have recovered from latest overload - we are still identifying the exact problem but feeling confident we should have this fixed soon.
resolved Mar 12, 2021, 04:50 AM UTC

The issue has been resolved and an emergency patch has been deployed to fix the issue. What caused the issue was an infinite loop in the slot generation code for the online scheduler. In certain conditions this code causes an infinite loop which also uses up all the memory on the server. This causes the server to become unresponsive for a time before eventually crashing and automatically restarting. This is why the downtimes were relatively short lived. This issue has been confirmed fixed and there are safeguards in place to prevent too many slots from being generated for a given day on the online scheduler.