Squiz incident

Network Issues in the AU DC

Major Resolved View vendor source →

Squiz experienced a major incident on January 14, 2025 affecting Squiz Cloud Hosted Instances, lasting 1h 11m. The incident has been resolved; the full update timeline is below.

Started
Jan 14, 2025, 03:03 PM UTC
Resolved
Jan 14, 2025, 04:14 PM UTC
Duration
1h 11m
Detected by Pingoru
Jan 14, 2025, 03:03 PM UTC

Affected components

Squiz Cloud Hosted Instances

Update timeline

  1. investigating Jan 14, 2025, 03:03 PM UTC

    We are currently witnessing intermittent network issues with one of our transit providers in the AU. We’ll continue to investigate the situation and provides updates as soon as possible.

  2. investigating Jan 14, 2025, 03:03 PM UTC

    We are continuing to investigate this issue.

  3. monitoring Jan 14, 2025, 03:40 PM UTC

    A fix has been implemented and services are restored. We are currently still monitoring the situation.

  4. resolved Jan 14, 2025, 04:14 PM UTC

    This incident has been resolved and all services are up and running

  5. postmortem Jan 15, 2025, 02:44 PM UTC

    **Summary** Several AU customers experienced website degradation between Jan 14, 2025, 23:47 AEST - Jan 15, 2025, 1:40 AEST. Squiz identified operational issues with one of our third party network providers. This had a negative effect on Matrix and Funnelback services in the AU, leading to search function disruptions, latency for several customers and some 504 errors. On January the 15th, at 19:51 AEST, during a scheduled maintenance window there was a further observed degradation of service. This was a re-occurrence of the same issue. ### **Customer Impact** A small subset of AU Customers experienced delays in search results when attempting to utilise the Funnelback search and web functions, as well as some more 504 service outages. ### **Issue, Resolution and Mitigation** We experienced concurrent intermittent traffic loss to/from NTT in both our Sydney and Melbourne DCs. The traffic loss was severe enough to trigger automatic rerouting of Ingress traffic to a different Transit provider. Because the packet loss was intermittent, this rerouting process resolved, then repeated several times. We intervened manually to force the exclusive use of a different Transit Provider in Sydney. This partially mitigated the issue, but it took some time for routing to fully recover. This also affected some of our internal observability systems. Once the NTT transit was stable, we reverted our mitigations to restore full redundant service.