Mammoth Cloud incident

Network outage affecting Xen VPS

Minor Resolved View vendor source →

Mammoth Cloud experienced a minor incident on August 8, 2016, lasting 3h 17m. The incident has been resolved; the full update timeline is below.

Started
Aug 08, 2016, 07:11 PM UTC
Resolved
Aug 08, 2016, 10:28 PM UTC
Duration
3h 17m
Detected by Pingoru
Aug 08, 2016, 07:11 PM UTC

Update timeline

  1. investigating Aug 08, 2016, 07:11 PM UTC

    We are aware of a network issue affecting our Xen VPS customers. The issue primarily seems to be impacting network access from international locations.

  2. investigating Aug 08, 2016, 09:33 PM UTC

    Issue is now affecting Australian traffic, we are investigating the issue.

  3. identified Aug 08, 2016, 09:52 PM UTC

    TPG has confirmed a fault with upstream router and are working to resolve the issue.

  4. monitoring Aug 08, 2016, 09:55 PM UTC

    Connectivity from both Australia and overseas has been restored

  5. resolved Aug 08, 2016, 10:28 PM UTC

    This incident has been resolved.

  6. postmortem Aug 01, 2018, 07:37 PM UTC

    At approximately 5:00AM AEST Mammoth started receiving a high rate of errors from our various monitoring probes for traffic destined to our Xen VPS servers ("Xen network" for rest of this postmortem). After investigating the fault we determined that while the Xen network was accessible from all Australian locations we tested, international traffic was being dropped at TPG router 202.7.173.230 . Here is an example traceroute: ``` 1 199.87.228.65 (199.87.228.65) 0.802 ms 1.353 ms 1.712 ms 2 pdx-edge-rtr01.forked.net (199.87.231.25) 2.214 ms 2.571 ms 2.830 ms 3 v323.core1.pdx1.he.net (216.218.244.225) 3.591 ms 4.012 ms 4.222 ms 4 * * * 5 10ge10-20.core1.sjc2.he.net (72.52.92.157) 18.788 ms 19.364 ms 19.588 ms 6 tpg-internet-pty-ltd.10gigabitethernet12-1.core1.sjc2.he.net (64.62.194.114) 197.041 ms 197.205 ms 197.328 ms 7 203-219-35-129.static.tpgi.com.au (203.219.35.129) 222.355 ms 222.117 ms 222.190 ms 8 202.7.173.230 (202.7.173.230) 194.314 ms 194.251 ms 194.317 ms 9 * * * ``` By comparison, a working traceroute previously ended like this: ``` 6 tpg-internet-pty-ltd.10gigabitethernet3-1.core1.sjc1.he.net (72.52.66.22) 148.588 ms 148.621 ms 148.616 ms 7 203-219-35-129.static.tpgi.com.au (203.219.35.129) 186.208 ms 186.483 ms 186.590 ms 8 202.7.173.230 (202.7.173.230) 186.779 ms 187.769 ms 187.336 ms 9 203.220.0.231.mammoth.net.au (203.220.0.231) 189.997 ms 190.253 ms 190.375 ms ``` (where hop 9 is the Xen network router) Fault was raised with TPG via email at 5:20AM AEST. With no resolution we escalated the issue by phone at 6:40AM, where it was confirmed to Mammoth that it was a router fault and was being worked on. At approximately 7:20AM TPG stopped announcing Mammoth IP space via BGP and the Xen network became inaccessible from all locations. At approximately 7:50AM BGP announcement resumed and service was restored for both Australian and international traffic. TPG has not confirmed the specifics of root cause or resolution but in traceroute we can now see ``` 6 tpg-internet-pty-ltd.10gigabitethernet3-1.core1.sjc1.he.net (72.52.66.22) 148.588 ms 148.621 ms 148.616 ms 7 203-219-35-147.static.tpgi.com.au (203.219.35.147) 149.843 ms 148.347 ms 148.442 ms 8 203.220.0.231.mammoth.net.au (203.220.0.231) 147.237 ms 147.209 ms 147.414 ms ``` The trace has shortened by one hop; and thus conclude: * router 202.7.173.230 is no longer in use; and * the fault was resolved by connecting Mammoth directly to the upstream router on 203.219.35.0/24 Thus, the total outage between 7:20AM and 7:50AM corresponds to TPG migrating Xen network from router 202.7.173.230 to direct connection with their upstream router.