Viirtue experienced a critical incident on August 16, 2024 affecting NJ2 Core Server, lasting 21h 6m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- identified Aug 16, 2024, 10:17 AM UTC
The Core2-NJ Apollo server is currently experiencing network-related issues affecting several services on the server. Traffic has been moved to the Florida core while Viirtue engineers work on the issue. If you are pointing directly to the Core2-NJ IP, 64.21.2.1, please move your traffic to either of the FQDNs below. Updates to follow. core2-fl.5060.cloud core-lv.5060.cloud
- identified Aug 16, 2024, 01:12 PM UTC
Viirtue engineers are pushing a potential fix within the next 15 minutes. Updates to follow.
- identified Aug 16, 2024, 01:35 PM UTC
Web portal traffic has been moved to the Las Vegas core and is stabilizing. Updates to follow.
- identified Aug 16, 2024, 01:59 PM UTC
Web portal traffic has stabilized. Engineers continue to work on the core2-nj server. Updates to follow.
- identified Aug 16, 2024, 02:42 PM UTC
Viirtue engineers continue to work towards restoring the core2-nj server. Updates to follow.
- identified Aug 16, 2024, 03:33 PM UTC
End clients could experience intermittent issues with Mobile Connect. All other services have stabilized. Updates to follow.
- identified Aug 16, 2024, 04:21 PM UTC
We are aware of an issue affecting inbound calls and have implemented a fix. We are seeing traffic stabilize. Updates to follow.
- identified Aug 16, 2024, 05:31 PM UTC
Inbound calling has stabilized. We will continue to work on restoring Core2-NJ. Updates to follow.
- identified Aug 16, 2024, 08:31 PM UTC
Engineers are working on restoring core2-nj during off-peak hours throughout the weekend. Updates to follow.
- resolved Aug 17, 2024, 07:24 AM UTC
Core 2 NJ is operational. Full RFO will be shared with the community before Monday morning. Thank you for your patience.
- postmortem Aug 19, 2024, 06:09 PM UTC
In the late hours of August 15th, 2024 \(11:15 PM EDT\), Viirtue engineers were performing a scheduled maintenance on Core2-NJ in preparation for Viirtue’s upgrade to Netsapiens v44. Portal traffic was migrated to our Florida data center in advance to keep all systems operational. After which, the engineers performed an in-place upgrade of the core server’s underlying operating system from Ubuntu 18 to 20 \(required for v44\). At approximately 3AM EDT on August 16th \(during the maintenance window\) the upgrade caused a “no boot” condition. The engineers continued to recover from the “no boot” condition and restore services as quickly as possible given the start of the business day was approaching. Knowing that the maintenance window would need to extend well into the day, all traffic bound for Core2-NJ was routed to Florida \(not just the portal\). It was after this move that a misconfiguration in replication between data centers caused a portal slowdown on Core2-FL. The server had been endlessly looping through replication events which compounded under load. At approximately 9:30 AM EDT portal traffic was migrated to our Las Vegas data center where we saw immediate relief and the portal became functional. During the day, the engineers worked diligently to restore Core2-NJ and with the hopes of bringing it back into service as quickly as possible to restore Sip Trunking service to partners who haven’t migrated to new SIP trunking infrastructure. Two unsuccessful attempts were made to bring Core2-NJ back into service \(11AM EDT and 4PM EDT\). Each of these attempts caused phone registrations issues, which lasted approx. 5 min each. With two strikes, we weren’t going to attempt a third until off peak/late night hours. The engineers resumed restoring service at 10:30 PM EDT and Core2-NJ was fully operational by Saturday, 3:00 AM EDT August 17th, 2024. Portal traffic was moved back to Core2-NJ 6:00 PM EDT that day.