Kalix EMR incident

API Down

Kalix EMR experienced a critical incident on March 22, 2021 affecting Kalix Platform and Telehealth and 1 more component, lasting 8d 5h. The incident has been resolved; the full update timeline is below.

Started: Mar 22, 2021, 07:22 PM UTC
Resolved: Mar 31, 2021, 12:44 AM UTC
Duration: 8d 5h
Detected by Pingoru: Mar 22, 2021, 07:22 PM UTC

Affected components

Kalix PlatformTelehealthOnline SchedulersMessagingNotifications

Update timeline

investigating Mar 22, 2021, 07:22 PM UTC

The API to Kalix is currently down resulting in Kalix not loading or not allowing user login. We are currently investigating the issue and hope to have a solution ASAP
investigating Mar 22, 2021, 07:23 PM UTC

We are continuing to investigate this issue.
monitoring Mar 22, 2021, 07:31 PM UTC

Api is back up again, along with Kalix. We are still determining the underlying issue for this downtime. It seems that the instances that serve the requests became unresponsive, but this was resolved automatically after some time and the instances come back up again automatically.
monitoring Mar 22, 2021, 07:57 PM UTC

We are continuing to monitor for any further issues.
investigating Mar 22, 2021, 09:22 PM UTC

We are currently investigating this issue.
investigating Mar 22, 2021, 09:23 PM UTC

Unfortunately, the API outage has reoccurred. We are investigating this problem and will keep you updated on this.
monitoring Mar 22, 2021, 09:41 PM UTC

A fix has been implemented and we are monitoring the results.
monitoring Mar 22, 2021, 09:45 PM UTC

We have found a possible reason for the downtime and are currently working on a fix. The issue seems to be related to conservative probes on the servers which means they shutdown too easily. We have just deployed a change that has fixed this issue, and will be testing some further changes.
identified Mar 23, 2021, 08:33 PM UTC

One of the servers went down and was unresponsive for approx 2 minutes before being restarted, then it came back online and was unresponsive again for an additional 2 minutes, and then one more restart. This means some requests during this time would have timed out. The server was then removed to prevent more restarts. We also have a better idea of what may have caused the server to become unresponsive.
monitoring Mar 24, 2021, 12:23 PM UTC

Overnight we have deployed some stronger servers and have fine tuned our deployment. We also identified a number of possible bugs that might have caused some memory leaks. So we are feeling more confident that we shouldn't have any issues today. We are monitoring the servers as the day goes on.
monitoring Mar 25, 2021, 02:41 PM UTC

Unfortunately we had a sick server this morning that would not be fixed even on a restart. This caused some connection issues in Kalix as some requests were hitting the sick server. Unfortunately on a restart the server stayed sick. We deployed a fix so that when a sick server is detected requests will not be directed to that server until it has recovered. This allowed the server to recover, and this should also mean that any problems like this should be avoided in the future. The downtime today lasted a total of 40 minutes
monitoring Mar 25, 2021, 03:19 PM UTC

We have deployed a fix so that sick instances will not receive requests. This should prevent downtimes of Kalix whenever any single server has any issues. We also have a number of servers in place to prevent possible downtimes in the case when ALL the servers have issues at the same time.
monitoring Mar 27, 2021, 12:31 AM UTC

We did have a 1 minute downtime in the afternoon today due to multiple servers going down at approximately at the same time. The greatly reduced overall downtime was due to the improvements we made over the week. However it looks like we haven't fixed the underlying issue as to why servers are restarting in the first place. However the downtime today did provide some clues in our logs that are indicating there is an error happening outside the application that is causing the server to reset. We have just pushed up a fix for this issue, and will monitor to see if this has solved the server restarting issue.
monitoring Mar 29, 2021, 03:58 PM UTC

This morning we have a small 503 issue which mentions an 'argo tunnel'. This issue DOES NOT seem to be related to our issues we've been having in Kalix in the last week, and instead seems to be a problem in cloudflare, which is an internet access company that we use. The issue seems to have been fixed already however and seems to have only affected a small number of requests. Over the weekend we have not seen any restarts in the servers, so we are cautiously optimistic that we managed to fix the underlying restart issue. We are monitoring today with a higher traffic load to see if it handles it OK.
resolved Mar 31, 2021, 12:44 AM UTC

There were no downtimes or restarts of the server today. We made some changes to keep everything static instead of trying to increase and decrease usage during the day and that has fixed the last of the issues we were seeing. We are confident that Kalix's servers will be stable moving forward and as such will be closing this issue.