NUACOM incident

Investigating issues with Cloud PBX ,Severity: major

NUACOM experienced a major incident on February 27, 2025, lasting 2h 20m. The incident has been resolved; the full update timeline is below.

Started: Feb 27, 2025, 08:16 AM UTC
Resolved: Feb 27, 2025, 10:37 AM UTC
Duration: 2h 20m
Detected by Pingoru: Feb 27, 2025, 08:16 AM UTC

Update timeline

investigating Feb 27, 2025, 08:16 AM UTC

We are investigating an incident affecting Cloud PBX. We will provide updates via email and Statuspage shortly.
identified Feb 27, 2025, 08:49 AM UTC

The issue has been identified and a fix is being implemented.
resolved Feb 27, 2025, 10:37 AM UTC

This incident has been resolved.
postmortem Mar 09, 2025, 09:49 PM UTC

## Incident Summary * **Date and Time \(Europe/Dublin\):** 27th Feb 2025, 00:50 – 09:06 * **Affected Services:** Voice Services ## Incident Description **What Happened** At approximately 00:50, a critical hardware malfunction occurred on a physical server in our data center. An error message—“A bus fatal error was detected on a component at slot 1”—indicated a failure in one of the PCI network interfaces. This caused the primary host to become inoperable, leading all virtual machines on that host to go offline. However, our backup server automatically took over services, ensuring that the majority of customers remained operational. **Conclusion** A hardware failure led to a partial outage, but our monitoring systems and backup infrastructure responded quickly to minimize disruption. Our team swiftly identified the root cause and resolved the incident. As part of our continuous improvement efforts, we are implementing additional steps to improve both our response times and our automated failover processes, ensuring that any such event is contained and resolved even more promptly in the future. **Assurance to Our Customers** We sincerely apologize for any inconvenience caused by this incident. Nuacom takes every disruption seriously, and we remain committed to providing a reliable, high-quality service at all times. Our team’s rapid response and the successful automatic failover process underscore our dedication to proactively managing issues. We will continue to strengthen our systems and procedures to prevent similar incidents from occurring in the future. If you have any questions or need further information, please reach out to our support team at any time. Your trust in Nuacom is greatly appreciated, and we are steadfast in our mission to keep your communications running seamlessly. ## Detection and Response **Detection** Our robust monitoring tools detected multiple connectivity alerts related to a single physical host, triggering a high-priority alert for our Network Operations Center \(NOC\). A subsequent review of logs and diagnostic data confirmed that the NIC on PCI Slot 1 was the failure point. **Response Actions Taken** * **00:55** – Automated failover kicked in, restoring service for most affected clients. * **06:50** – NOC team began a detailed investigation. * **07:30** – Root cause pinpointed and verified as hardware failure. * **08:10** – Primary host restarted to stabilize operations. * **08:16** – Remaining services were manually transferred to the backup host, ensuring minimal further interruption. * **09:06** – Service containers on the primary host came back online once the hardware issue was resolved. ## Impact Assessment **User Impact** * **Partial Outage** – Approximately ~7.2% of our customer accounts experienced a temporary loss of calling functionality during low-traffic hours and partial service degradation during early business hours. Some clients were unable to place or receive calls until the failover and manual transfers were completed. Despite these challenges, our failover safeguards ensured that most customers experienced minimal disruption and the incident was contained and addressed as quickly as possible. ## Lessons Learned 1. **Enhanced Disaster Recovery Drills** – We will conduct more frequent and comprehensive failover tests to further reduce potential downtime. ## Preventative Measures **Short-Term Actions** * Confirm the full operational status of all services immediately following the incident. * Heighten our monitoring protocols to proactively detect any new hardware anomalies. **Long-Term Actions** * Schedule additional, more frequent disaster recovery exercises. * Expand our redundancy testing suite with new scenarios drawn from this incident to better anticipate future failures. * Add night shift NOC coverage to respond to incidents 27/7.