Scale SERP experienced a minor incident on July 15, 2020 affecting API, lasting 3h 35m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- identified Jul 15, 2020, 06:30 AM UTC
We are working with our hosting partners to mitigate the effects of a DDOS attack on our infrastructure that started at 5.30am UTC. Service is now being restored. API endpoints are responding with increased latency at this time.
- monitoring Jul 15, 2020, 07:00 AM UTC
DDOS mitigation is ongoing. We continue to monitor the situation. API responses and Batches will process at a slower rate.
- monitoring Jul 15, 2020, 08:36 AM UTC
The platform is processing the Batch backlog generated during the incident (Batches are automatically paused during an incident and resume afterwards). We anticipate the backlog will be processed by 9:00 UTC. The real-time endpoints are functioning but with increased latency. We will post a post-mortem of the incident later today.
- monitoring Jul 15, 2020, 09:21 AM UTC
The Batch backlog has now been processed. Real-time endpoints should return to normal levels of latency in the next 45 minutes. For customers using the skip_on_incident request parameter requests will continue to gracefully fail until the incident is closed.
- monitoring Jul 15, 2020, 09:38 AM UTC
We continue to monitor for further issues having mitigated the effects of the earlier DDOS attack. We will post a root-cause analysis after the incident has closed.
- resolved Jul 15, 2020, 10:06 AM UTC
This incident has been resolved.
- postmortem Jul 15, 2020, 10:06 AM UTC
We apologise for the interruption to service due to the DDOS attack earlier today. At c0530 UTC we were alerted via our DNS provider \(CloudFlare\) to a DDOS attack targeting our primary internet facing API endpoints. The attack peaked at over 65GB/s. We employ anti-DDOS measures on all internet-facing endpoints and these measures have been successful at defeating previous DDOS attacks however today’s incident involved the use of a customer APIKey that has been compromised \(i.e. was in the possession of the attacker\). The customer has since confirmed they were the subject of a hacking incident several days ago and this is likely how the attacker came to be in possession of their API key. Our anti-DDOS measures were less restrictive when faced with an authenticated request \(i.e. a request from a valid customer API key\). Due to the volume of traffic being received the services responsible for authenticating a customer API Keys \(authenticating and decrementing credits\) could not scale quickly enough and became overwhelmed. This resulted in slow responses from the API and timeouts. At c0700 UTC the affected API key was isolated and blocks put in place. During the time of the DDOS attack a significant backlog of Batches has built up. The platform resumed processing these and this was complete by c0930 UTC. We provisioned more resource to accommodate this load. During this time the real-time endpoints were serviced via a failover system and continued to experience longer response times. At c0930 the backlog had been processed and migration from the failover system to the live system was initiated on the real-time endpoints. To mitigate against a future recurrence we are implementing the following: 1. An option on the Dashboard to allow you to regenerate your API key in the event that it has been compromised \(in the meantime this can be requested via our support channels\). 2. Most robust anti-DDOS protections for authenticated requests and significantly more aggressive rate-limiting for suspected API key abuse. Once again we sincerely apologise for the disruption during this incident.