NUACOM incident

Investigating issues with Cloud PBX ,Severity: major

NUACOM experienced a major incident on February 27, 2025 affecting Cloud PBX, lasting 4h 46m. The incident has been resolved; the full update timeline is below.

Started: Feb 27, 2025, 01:52 PM UTC
Resolved: Feb 27, 2025, 06:39 PM UTC
Duration: 4h 46m
Detected by Pingoru: Feb 27, 2025, 01:52 PM UTC

Affected components

Cloud PBX

Update timeline

investigating Feb 27, 2025, 01:52 PM UTC

We are investigating an incident affecting Cloud PBX. We will provide updates via email and Statuspage shortly.
resolved Feb 27, 2025, 06:39 PM UTC

All services are restored, and the NOC team closely monitors the system. The root cause was detected, and a future mitigation plan is put in place along with the incident report. Please accept our sincere apologies for the issues caused to your business.
postmortem Mar 09, 2025, 10:02 PM UTC

## Incident Summary * **Date and Time \(Europe/Dublin\):** 27th Feb 2025, 13:50 – 17:00 * **Duration:** 3 hours 10 minutes * **Affected Services:** Calling system ## Incident Description **What Happened** At approximately 13:50, a critical hardware malfunction occurred on a data center server. The error message—“A PCI parity error was detected on a component”—pointed to a defective network card. As a result, the host machine became inoperable, and all virtual machines on this host went offline. Complicating matters, this server was the designated failover server for the primary machine that had experienced a similar malfunction earlier in the day. Because the failover server was still recovering and performing a file system check, it was not fully prepared to handle the additional load. Consequently, our operations team had to manually redistribute services to other servers, resulting in extended downtime. **Conclusion Summary** This was the second hardware failure to occur on the same day—an exceptional occurrence given our strong track record of zero server faults over the past five years. Moving forward, we are focusing our efforts on faster automated recoveries and more robust load and stress testing to prevent similar incidents. **Assurance to Our Customers** We sincerely apologize for any inconvenience caused by this incident We recognize the inconvenience this second outage caused and are committed to preventing similar incidents. Our team is actively refining both automated and manual failover processes to ensure quicker recovery times. Through enhanced monitoring, comprehensive testing, and ongoing infrastructure improvements, we will continue delivering reliable, high-quality service. If you have any concerns or questions, please reach out to our support team at any time. ## Detection and Response **Detection** Our monitoring systems detected multiple connectivity alerts originating from the same physical host, immediately notifying our Network Operations Center \(NOC\). **Response Actions Taken** * **13:50** – The hardware fault occurred, and the NOC team was alerted. * **13:55** – Automated failover processes reassigned most affected clients, mitigating the impact for a majority of customers. * **14:00** – Diagnostics confirmed the malfunction was hardware-related. * **14:10** – Manual redistribution of client accounts to alternative hosts began. * **17:00** – Full redistribution completed, restoring all services. ## Impact Assessment **User Impact** * Approximately 7% of clients experienced a complete outage of PBX and calling services while the redistribution was underway. While the automated failover partially succeeded, the event highlighted critical areas where additional failover capacity and faster manual procedures can further minimize future downtime. ## Lessons Learned 1. **Automation Enhancement** – We need improved tooling for situations where automated failover cannot manage the entire load, allowing for quicker manual redistribution. 2. **Improved Response Times** – Regular drills and simulations will enable our team to respond more efficiently to unexpected hardware faults. ## Preventative Measures **Short-Term Actions** * Verify that all services are fully operational following this incident. * Maintain heightened system monitoring to detect potential hardware anomalies earlier. * Schedule additional Disaster Recovery Testing to confirm failover resilience. **Long-Term Actions** * Conduct regular redundancy testing to ensure minimal downtime in the event of hardware failures. * Refine failover testing procedures to maintain consistent high availability. * Implement load and stress testing to prepare systems to handle recovery processes more seamlessly.