Viirtue incident

Core2-NJ Apollo

Critical Resolved View vendor source →

Viirtue experienced a critical incident on August 16, 2024 affecting NJ2 Core Server, lasting 21h 6m. The incident has been resolved; the full update timeline is below.

Started
Aug 16, 2024, 10:17 AM UTC
Resolved
Aug 17, 2024, 07:24 AM UTC
Duration
21h 6m
Detected by Pingoru
Aug 16, 2024, 10:17 AM UTC

Affected components

NJ2 Core Server

Update timeline

  1. identified Aug 16, 2024, 10:17 AM UTC

    The Core2-NJ Apollo server is currently experiencing network-related issues affecting several services on the server. Traffic has been moved to the Florida core while Viirtue engineers work on the issue. If you are pointing directly to the Core2-NJ IP, 64.21.2.1, please move your traffic to either of the FQDNs below. Updates to follow. core2-fl.5060.cloud core-lv.5060.cloud

  2. identified Aug 16, 2024, 01:12 PM UTC

    Viirtue engineers are pushing a potential fix within the next 15 minutes. Updates to follow.

  3. identified Aug 16, 2024, 01:35 PM UTC

    Web portal traffic has been moved to the Las Vegas core and is stabilizing. Updates to follow.

  4. identified Aug 16, 2024, 01:59 PM UTC

    Web portal traffic has stabilized. Engineers continue to work on the core2-nj server. Updates to follow.

  5. identified Aug 16, 2024, 02:42 PM UTC

    Viirtue engineers continue to work towards restoring the core2-nj server. Updates to follow.

  6. identified Aug 16, 2024, 03:33 PM UTC

    End clients could experience intermittent issues with Mobile Connect. All other services have stabilized. Updates to follow.

  7. identified Aug 16, 2024, 04:21 PM UTC

    We are aware of an issue affecting inbound calls and have implemented a fix. We are seeing traffic stabilize. Updates to follow.

  8. identified Aug 16, 2024, 05:31 PM UTC

    Inbound calling has stabilized. We will continue to work on restoring Core2-NJ. Updates to follow.

  9. identified Aug 16, 2024, 08:31 PM UTC

    Engineers are working on restoring core2-nj during off-peak hours throughout the weekend. Updates to follow.

  10. resolved Aug 17, 2024, 07:24 AM UTC

    Core 2 NJ is operational. Full RFO will be shared with the community before Monday morning. Thank you for your patience.

  11. postmortem Aug 19, 2024, 06:09 PM UTC

    In the late hours of August 15th, 2024 \(11:15 PM EDT\), Viirtue engineers were performing a scheduled maintenance on Core2-NJ in preparation for Viirtue’s upgrade to Netsapiens v44. Portal traffic was migrated to our Florida data center in advance to keep all systems operational. After which, the engineers performed an in-place upgrade of the core server’s underlying operating system from Ubuntu 18 to 20 \(required for v44\). At approximately 3AM EDT on August 16th \(during the maintenance window\) the upgrade caused a “no boot” condition. The engineers continued to recover from the “no boot” condition and restore services as quickly as possible given the start of the business day was approaching. Knowing that the maintenance window would need to extend well into the day, all traffic bound for Core2-NJ was routed to Florida \(not just the portal\). It was after this move that a misconfiguration in replication between data centers caused a portal slowdown on Core2-FL. The server had been endlessly looping through replication events which compounded under load. At approximately 9:30 AM EDT portal traffic was migrated to our Las Vegas data center where we saw immediate relief and the portal became functional. During the day, the engineers worked diligently to restore Core2-NJ and with the hopes of bringing it back into service as quickly as possible to restore Sip Trunking service to partners who haven’t migrated to new SIP trunking infrastructure. Two unsuccessful attempts were made to bring Core2-NJ back into service \(11AM EDT and 4PM EDT\). Each of these attempts caused phone registrations issues, which lasted approx. 5 min each. With two strikes, we weren’t going to attempt a third until off peak/late night hours. The engineers resumed restoring service at 10:30 PM EDT and Core2-NJ was fully operational by Saturday, 3:00 AM EDT August 17th, 2024. Portal traffic was moved back to Core2-NJ 6:00 PM EDT that day.