Level365 experienced a notice incident on August 14, 2023 affecting Core UCaaS Services, lasting 22h 42m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Aug 14, 2023, 02:45 PM UTC
We are currently investigating an issue with outbound calling and device connections.
- monitoring Aug 14, 2023, 02:56 PM UTC
The impacted services have been restarted and we are no longer seeing the issues. We will continue to monitor this and provide further updates.
- resolved Aug 15, 2023, 01:27 PM UTC
This issue has been resolved. A post mortem will be prepared outlining the root cause.
- postmortem Aug 30, 2023, 08:21 PM UTC
On the morning of August 14th, 2023, our internal monitoring detected that resource utilization for a particular service on one node in our Midwest data center increased from 6% to 80%. We immediately engaged our engineering department and began to investigate the cause of this. During our investigation, another increase occurred raising the 80% utilization to 100%. Once this occurred the node became intermittently unavailable to clients, causing a partial service disruption. Restarting the offending service successfully returned resource utilization to the previous 6% baseline and eliminated the disruptions being experienced. We apologize for this brief disruption in service performance. Further post-incident troubleshooting indicated that the service in question suffered from a memory leak in another service. This memory problem was triggered by an unrelated issue that has since been resolved. Engineering is working on a permanent fix for the memory leak in the offending service, and we will be installing this in the future once it has been completed and tested.