YourCloudTelco incident

Call Quality issues

Minor Resolved View vendor source →

YourCloudTelco experienced a minor incident on March 1, 2021 affecting YourCloudTelco Calling Platform (Network), lasting 3d 5h. The incident has been resolved; the full update timeline is below.

Started
Mar 01, 2021, 12:33 AM UTC
Resolved
Mar 04, 2021, 06:02 AM UTC
Duration
3d 5h
Detected by Pingoru
Mar 01, 2021, 12:33 AM UTC

Affected components

YourCloudTelco Calling Platform (Network)

Update timeline

  1. investigating Mar 01, 2021, 12:33 AM UTC

    We are investigating call quality issues.

  2. investigating Mar 01, 2021, 05:53 AM UTC

    Today we received consistent call quality complaints. This is not related to last Friday’s fibre issue between our service and Mega-IX which was a fault remedied by Equinix on Saturday 28 Feb. In the examples received the first hops between Vocus and Mega-IX are all normal (0% packet loss round-trip min/avg/max/stddev = 0.492/1.665/14.604/2.652 ms--- 103.85.38.5 ping statistics —). The packet loss is consistently beyond Hop 3: --- 103.26.68.194 ping statistics --- 100 packets transmitted, 100 packets received, 0% packet loss round-trip min/avg/max/stddev = 0.492/1.665/14.604/2.652 ms--- 103.85.38.5 ping statistics --- Hop 2 - mega-ix router --- 103.26.68.194 ping statistics --- 100 packets transmitted, 100 packets received, 0% packet loss round-trip min/avg/max/stddev = 0.492/1.665/14.604/2.652 ms Hop 3 - router / device / host beyond hop 2. --- 103.85.38.5 ping statistics --- 100 packets transmitted, 96 packets received, 4% packet loss round-trip min/avg/max/stddev = 0.537/4.378/122.301/19.753 ms We are not seeing reports of equivalent quality issues against our AAPT/TPG (via our fiber transit) or our Telstra customers connecting via the mega-IX. We have identified some common patterns emerging of services supplied by third-party ISP’s which we hope to prove in the morning.

  3. investigating Mar 02, 2021, 04:49 AM UTC

    We believe we have identified a consistent pattern of packet loss coming into our network. Tomorrow morning between 5:00-6:00 am we will make a change to our external monitoring in support of our assumption. While we don't anticipate any disruption to calling as a precaution the change out of hours.

  4. monitoring Mar 03, 2021, 12:04 AM UTC

    Change to the network has been made and we are no longer seeing the pack loss, call quality has improved. We will provide details of the change shortly.

  5. resolved Mar 04, 2021, 06:02 AM UTC

    This incident has been resolved.

  6. postmortem Mar 04, 2021, 06:02 AM UTC

    Last year we introduced several discrete Elastic monitoring services including Elastic Heartbeat for up and response time, and PacketBeat for traffic flow and latency overhead. The two services combined have been invaluable in helping our DevOps team identify and reduce the time to resolutions. On Friday 26 Feb we increased the frequency of both monitoring tools to help identify the fibre disruption which appeared at 11:30 am that morning. For reasons still unknown, the HeartBeat monitor created a recurring 10s latency spike causing a momentary packet loss on one of the voice servers. Even after we’d discovered and temporally disabled the monitor, the packet loss continued which made no sense as HeartBeat has no footprint, it just reads. At 6:00 am on 3 March we reset both the impacted voice machine and the monitoring service, clearing the recurring packet loss which we correctly predicted removed the corresponding voice issue. To further verify, we reimplemented HeartBeat against the same machine with no repeat of the previous bad behaviour consistent also with our remaining voice machines. Our conclusion is one of the system calls or plugin used to deliver the stats was the probable cause within the kernel. We have now been monitoring the service for approximately 35 hours with no repeated incidents. Through our continual testing, we have identified a group of Vocus NBN resellers with recurring packet loss upstream of our main Vocus transit link, albeit with very low latency. We are monitoring customers connecting through these ISPs as they may continue to experience poorer call quality. This is not related to the above incident, and we have raised this through to Vocus for their monitoring.