Alkira incident
Routing and Health Monitoring issues in US-EAST CXP
Alkira experienced a major incident on November 28, 2022, lasting 4h 22m. The incident has been resolved; the full update timeline is below.
Update timeline
- investigating Nov 28, 2022, 08:59 PM UTC
We are currently investigating an issue with routing and health monitoring services in US-EAST CXP.
- identified Nov 28, 2022, 09:05 PM UTC
We have identified the issue and are actively working towards fixing it.
- monitoring Nov 28, 2022, 09:09 PM UTC
We have resolved the issue for now and actively monitoring it. We will post an update with more details soon.
- resolved Nov 29, 2022, 01:22 AM UTC
As of 21:10 UTC, the issue was fully resolved. We have identified the issue and will post a detailed report here.
- postmortem Nov 29, 2022, 02:53 AM UTC
Between 19:53-21:10 UTC, November 28th, one of the internal databases in US-EAST CXP was overloaded and some of the services in that region were not able to connect or read/write from that database. All the services in that region were resilient to this issue, however, there was an impact on the Routing service and Health monitoring service in that region. Routing service impact was specific to IPSec connectors with IKE\_STATUS-based health reporting. Those connectors might have seen route withdrawals and advertisements a few times during this window. This could have caused a loss of connectivity to CXP over those IPSec connectors. Health monitoring service was impacted because of this issue and health for the connectors might have been reported as Down on the Network page. This is only a reporting issue on UI and there was no issue with any of the tunnels connecting to the cloud connectors. All other services were operating normally during this time in US-EAST and all other CXP regions.