Alkira incident

Health Reporting Service Down

Minor Resolved View vendor source →

Alkira experienced a minor incident on May 31, 2023 affecting CACENTRAL-AZURE-1 (Toronto) and CAEAST-AZURE-1 (Quebec City) and 1 more component, lasting 21m. The incident has been resolved; the full update timeline is below.

Started
May 31, 2023, 08:50 PM UTC
Resolved
May 31, 2023, 09:12 PM UTC
Duration
21m
Detected by Pingoru
May 31, 2023, 08:50 PM UTC

Affected components

CACENTRAL-AZURE-1 (Toronto)CAEAST-AZURE-1 (Quebec City)USCENTRAL-AZURE-1 (Texas)USCENTRAL-AZURE-3 (Iowa)USEAST-AZURE-1 (Virginia)USEAST-AZURE-2 (Virginia)

Update timeline

  1. investigating May 31, 2023, 08:50 PM UTC

    We are investigating an issue with Azure regions where the tunnel health reporting service fails.

  2. investigating May 31, 2023, 08:53 PM UTC

    We see that provisioning services in these regions are impacted as well.

  3. investigating May 31, 2023, 08:55 PM UTC

    We are continuing to investigate this issue.

  4. investigating May 31, 2023, 09:02 PM UTC

    We are actively working on recovering the services, and we expect to see recovery soon.

  5. investigating May 31, 2023, 09:08 PM UTC

    All the services should have been recovered now. The health of the connectors should be restored to their correct state on the topology.

  6. investigating May 31, 2023, 09:11 PM UTC

    We are continuing to investigate this issue.

  7. resolved May 31, 2023, 09:12 PM UTC

    We have resolved the issue now and are actively monitoring the services. We will post an RCA on this issue soon.

  8. postmortem May 31, 2023, 09:13 PM UTC

    At approximately 20:40 UTC on May 31st, we noticed an increase in workload on one of our infrastructure clusters that are serving USCENTRAL-AZURE-3, USEAST-AZURE-2, CACENTRAL-AZURE-1, CAEAST-AZURE-1, USEAST-AZURE-1 CXP regions. Health reporting and provisioning services were impacted as part of this increased workload. We quickly added more nodes to the infrastructure cluster to remediate and recover the failing services at 21:10 UTC. We don't anticipate this to occur again and are actively reviewing all other regions for any spike in workload. Please reach out to Alkira Support if you have any questions.