Alkira incident

Portal is inaccessible

Alkira experienced a critical incident on April 2, 2025 affecting Management Portal, lasting 1h 41m. The incident has been resolved; the full update timeline is below.

Started: Apr 02, 2025, 06:15 PM UTC
Resolved: Apr 02, 2025, 07:56 PM UTC
Duration: 1h 41m
Detected by Pingoru: Apr 02, 2025, 06:15 PM UTC

Affected components

Management Portal

Update timeline

investigating Apr 02, 2025, 06:15 PM UTC

We are currently investigating an issue with the portal being inaccessible.
investigating Apr 02, 2025, 06:16 PM UTC

We are continuing to investigate this issue.
investigating Apr 02, 2025, 06:22 PM UTC

As of April 2nd at 6:00 PM UTC, internal alerts were triggered, indicating an issue with the management portal service. Our team is actively investigating the incident.
investigating Apr 02, 2025, 06:43 PM UTC

Management portal service has still not recovered, we are still investigating the issue.
identified Apr 02, 2025, 07:08 PM UTC

We have identified the issue and are trying various ways to remediate it.
identified Apr 02, 2025, 07:13 PM UTC

We have identified and fixed the issue. We are seeing signs of recovery, and the portal should be accessible. We will continue to remediate this issue and post a root cause analysis.
monitoring Apr 02, 2025, 07:14 PM UTC

A fix has been implemented and we are monitoring the results.
resolved Apr 02, 2025, 07:56 PM UTC

This incident is fully resolved. A detailed postmortem will follow.
postmortem Apr 08, 2025, 02:11 AM UTC

### Summary of the Problem On April 2nd, 2025, at approximately 17:50 UTC, one of the core infrastructure services supporting the Alkira Management Portal experienced significant memory pressure. This led to temporary portal unavailability, impacting customers’ ability to access and manage their network configuration. The issue was entirely resolved by 19:10 UTC the same day. ### Findings The incident was triggered by a burst of complex API requests that caused elevated memory usage within one of the services. The root cause was identified promptly, and mitigation steps were initiated quickly. However, full recovery took additional time due to ensuring sustained health before restoring access to all users. While the platform is designed to support a wide range of workloads, this scenario highlighted opportunities to speed up recovery workflows in similar scenarios. Customer access to the portal was limited during this time, but data traffic was not affected. ### Corrective Actions * Immediately applied targeted API throttling and usage restrictions for the APIs that caused the memory spike. * Restarted and stabilized the affected service once memory levels normalized. * Closely monitored system performance to ensure full recovery and continued stability. ### Next Steps To further enhance platform resilience: * Apply API rate limiting and throttling mechanisms across the platform. * Optimize existing APIs to be more lightweight and resource-efficient.