Provation Software incident

Provation iPro is currently experiencing technical difficulties

Provation Software experienced a critical incident on April 13, 2023 affecting iPro-anesthesia-instance-3684, lasting 44m. The incident has been resolved; the full update timeline is below.

Started: Apr 13, 2023, 07:35 PM UTC
Resolved: Apr 13, 2023, 08:20 PM UTC
Duration: 44m
Detected by Pingoru: Apr 13, 2023, 07:35 PM UTC

Affected components

iPro-anesthesia-instance-3684

Update timeline

investigating Apr 26, 2023, 09:44 PM UTC

We are actively investigating an incident with iPro.
resolved Apr 26, 2023, 09:45 PM UTC

Provation iPro has fully recovered. We apologize for the inconvenience.
postmortem Apr 26, 2023, 09:49 PM UTC

Storage Outage – 4/13/2023 On Thursday, April 13th, 2023, at 2:35 PM Central Time, LightEdge Engineers began receiving alerts suggesting that one member of a pair of redundant core network switches had initiated a self-reboot. LightEdge Support and Engineering teams immediately created an incident bridge to evaluate internal and customer impact. While investigating, LightEdge Engineers identified that internet access for customers on the affected switch had failed over to the secondary switch successfully, but the storage network uplinks were showing as down. With the storage network uplinks in a down state, customer environments consuming shared storage resources became unavailable. Network Engineers attempted a series of commands against the switch pair to reset the downed uplinks, and at 3:20 PM Central Time, attempts were successful, restoring connectivity to the storage network. Once connectivity to the storage network was restored, LightEdge Engineers began rolling restarts to LightEdge services. At 6:00 PM Central Time, all Dedicated and Virtual Private Cloud Environments had recovered. In some instances, virtual machines required a reboot to restore complete environment functionality. To ensure long-term stability, LightEdge engineers started host reboots across all affected environments. In the investigation conducted after service restoration, it was determined that a firmware bug caused the initial switch reboot. There is an ongoing investigation into the cause of the storage network not failing over properly. LightEdge takes the redundancy of all network infrastructure seriously and appropriate measures are being taken to remediate network resiliency. Incident Action Items Activity Owner Status Completed on Due Restore SAN connections to all devices LightEdge Complete 4/13/23 Virtual Private Cloud host reboots LightEdge Complete 4/14/23 Dedicated Private Cloud host reboots LightEdge In Process 4/19/23 Virtual Machine reboots Customer Advised TBD Deploy new switch fabric LightEdge In Process 4/21/23 Migrate storage nodes to new switch fabric In Process 5/5/23 Please let us know if there are additional questions or concerns regarding this incident. Thank you, Brian Gibson Director, Customer Care