Sirv experienced a minor incident on October 23, 2023 affecting Sirv CDN (CDN request from Los Angeles) and Sirv CDN (CDN request from New York), lasting 22d. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Oct 23, 2023, 01:10 PM UTC
A dramatic increase in requests to the Washington D.C. CDN location occurred at 14:10 UTC on 23 October, causing a small proportion of requests to be returned slowly - greater than 2 seconds. The issue subsided over the next 3 hours, though average response time remained elevated. The issue returned the following day at 14: 32 UTC on 24 October, more severely than before causing 17% of requests to load slowly - greater than 2 seconds and some as long as 30 seconds. The issue impacted two of Sirv's 25 CDN locations.
- resolved Nov 14, 2023, 01:37 PM UTC
The detailed investigation into this issue found two contributing factors. The first was an update to AWS Route53, which appears to have been implemented by AWS on 17 October. This change was not made by Sirv to its Route53 rules but a change within the Route53 service itself. The Los Angeles POP had slightly different routing logic to other POPs and the AWS update caused routing to behave differently, with traffic becoming routed to the Washington DC fallback. This went unnoticed as he higher load was still within its capacity and requests were returned successfully, so this went unnoticed as Washington DC was able to handle the additional load. The second contributing factor was higher than normal traffic during peak US trading hours on 23 and 24 October. The two factors combined caused the Washington DC hard disks to return requests significantly slower than normal. Since the issue occurred, we have taken multiple actions to prevent it repeating; to mitigate against similar possible incidents; and to accelerate resolution time in the event of an incident. Actions include: New routing logic has been applied. Washington DC capacity has been increased. Reserve capacity has been increased in all other CDN locations. New tolerance introduced for a CDN location to be removed from routing. Additional disk monitoring, to help early preparation of capacity upgrades. HDDs are being replaced with SSDs. Updated SOP for our support team to investigate and respond to similar issues. Classified as severity level 3: Minor impact.