Kalix EMR experienced a major incident on April 1, 2021 affecting Kalix Platform and Telehealth, lasting 2h 44m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Apr 01, 2021, 10:03 PM UTC
Some customers are reporting an unknown error. We are not seeing any errors according to our tracking and servers are operational. It seems to be a DNS problem. We are currently investigating.
- identified Apr 01, 2021, 10:16 PM UTC
This issue has been identified. Our authentication server is still on our old system and uses Azure DNS (DNS is the service that matches up a website name to a machine address), which is currently experiencing issues. The status can be tracked at the url below. It looks like the issue is still ongoing though it doesn't seem to be affecting Kalix right now, however it does mention the issues are intemittent: https://status.azure.com/en-us/status It looks like cloudflare which hosts our main sites have cached the previous values, so results may be patchy depending on whether a previous record was saved or not. It seems that there was 15 minutes of disruption over the last 30 minutes when we saw a lot of DNS issues. This issue is not related to the Kalix related problems from last week. The servers themselves are working and are not restarting and are not the cause of these issues.
- monitoring Apr 01, 2021, 10:30 PM UTC
We have bypassed the Azure DNS by applying a manual DNS record into cloudflare that points directly to the servers. This should prevent any more problems related to the Azure DNS. We are monitoring and if there are any other related records we will be fixing those too.
- resolved Apr 02, 2021, 12:47 AM UTC
There have been no further issues after switching our DNS record, and Azure's DNS provider is also fixed. We will continue to apply the bypass so that Kalix is not reliant on the DNS server. We will close this issue now as the fix looks like it is working correctly.