7digital experienced a critical incident on March 22, 2020 affecting Downloading and Catalogue API and 1 more component, lasting 52m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Mar 22, 2020, 11:07 PM UTC
Dear clients, From 22:45 GMT - we are currently experiencing severe platform outage affecting all areas of the 7digital API. Our on-call support engineers are currently investigating the issue and taking action to further stabilise the platform. You can subscribe to updates via email, webhooks and RSS feed on our statuspage (https://status.7digital.com/). If you would like to receive SMS updates, please create a Service Desk ticket with the Client Success Team. Once we have further updates we'll share them with additional announcements. If you have any questions, please create a Service Desk ticket with the Client Success Team. With best regards, 7digital Client Success Team
- monitoring Mar 22, 2020, 11:41 PM UTC
Dear clients, The platform has now been restored and the error rate has dropped since 23:17 GMT. Our Tech team will continue to monitor the platform before we close this incident. If you continue to experience issues with the 7digital platform create a Service Desk ticket with the Client Success Team. With best regards, 7digital Client Success Team
- resolved Mar 22, 2020, 11:59 PM UTC
Dear clients, The platform outage experienced today is now closed. The platform is back to normal and the error rate has completely dropped since 11:25. Our Tech team will continue to investigate and an incident report will be shared in due course. If you continue to experience issues with the 7digital platform create a Service Desk ticket with the Client Success Team. With best regards, 7digital Client Success Team
- postmortem Mar 24, 2020, 02:20 PM UTC
**7digital Incident Report** ### **Incident Details:** **Incident Summary** A switch within the CTR data centre power cycled itself, causing the ILB high-availability cluster to failover. Whilst the ILB failover completed, the automatic failback \(after the switch recovered\) left the ILB and XRP in a state of limbo which only got resolved when keepalived was restarted on all nodes. This caused an almost complete API outage, since most critical APIs rely on the ILB to route API calls. In addition, the cloud catalogue API did not recover as quickly as the data centre services due to it making use of a DNS entry where the automatic failover was disabled. ### **Timeline** 22:42 - On-call SRE receives multiple pingdom down alerts across all APIs 22:45 - SRE online, reports VPN to access platform is online. Identifies the severity of the outage and calls Client Success OOH. 22:52 - API health dashboard shows 100% error rate on most endpoints; large response time \(> 2 seconds\). Core Platform errors dashboard shows an initial load of API Router errors indicating they are timing out whilst connecting to the DB. 22:56 - Client Success indicate they are working on notification to clients. 22:56 - SRE starts working through "data centre failure modes" runbook. 22:59 - DC cross connect identified as being up. 23:02 - ILB IP announcements look OK according to "ip a". SRE notices that the backup ILB briefly recently received some traffic. SRE decides to reset keepalived to force re-announcements of IPs anyway. 23:05 - Most services start to recover; pingdom alerts clear; Core Platform application errors mostly clear apart from webstore & comparison-reproxy. 23:07 - SRE asked by Client Success if the prepared platform announcement should go out since the platform looks to be recovering. Decision taken to send it as stability is not clear. Notification sent to clients. 23:07 - SRE notices that VHC has taken all API traffic and Pingdom is still reporting CTR XRP as down. ~/track/details is also reported to be down. 23:11 - API origin DNS \(which Pingdom check is using\) is found to be pointing to CTR. DNS made easy shows the record's auto-failover mode has been disabled. SRE re-enables the auto-failover. Pingdom alerts recover for all but the CTR XRP check. 23:15 - SRE notices that the release details endpoint still has a high error rate \(50%\) and the 7digital D2C webstore is still erroring in Core Platform application errors. Client Success manually checks ~/release/details and finds that it looks stable. 23:31 - Whilst investigating the issue with ~/release/details, SRE notices that the errors to that endpoint dropped off from 23:26. 23:36 - SRE tells Client Success that the platform looks all UP now, but they are still monitoring and looking into the loss of DC redundancy \(CTR not handling API traffic\). 23:39 - SRE forces 99% of API traffic to VHC whilst investigating CTR issues. 23:41 - Client Success update incident status to "monitoring" 23:43 - SRE restarts NGINX on CTR XLB 00 and 01, has no effect. Restarting keepalive on those hosts however restores CTR's XRP service. Pingdom alert clears. 23:59 - Incident closed. **Duration of outage/incident \(Time to Recovery\):** 25 minutes **Time taken to isolate/diagnose the issue \(Time to Isolate\):** 25 minutes ### **Impact** **What applications or services were affected?** Any partner services and internal applications \(inc. web store\) which use our API. **How might these services have been affected?** Indicators show a complete outage during this time. Error rates of 100% and high response time. ### **Technical Details** It seems like whilst the platform correctly failed over given a presumed network blip, once the blip had resolved itself, the failback did not complete cleanly. This caused the almost-total outage of the API. A smaller, secondary problem with how DNS is updated given the loss of a DC's XRP, caused the cloud catalogue service to fail. This mainly affected 7digital's webstore service, and did not impact partners. **Dashboards:** Core Platform Application Errors:  API Health dashboard:  Data Centre Usage:  **Analysis of our response to rectifying the incident** As is the case with a lot of networking-triggered incidents, the information available to SREs was at first confusing and did not immediately reveal a resolution. However, since the team had witnessed something similar happen in the past, we had a runbook at hand to help SREs diagnose networking & data centre issues. This proved a decisive factor in the relatively quick recovery of the service, given the complex nature of the fault. Process wise, we were quick to identify the impact to customers and client success was able to notify partners as quickly as possible. We have also identified that we could have better documentation on how the cloud catalogue service is architected, so that the SRE team can better understand and recover the service. ### **Analysis of the technical issue/s** Ideally, switch power cycles/failures should be able to happen and our infrastructure automatically recover, or failover. In this instance, the infrastructure did not recover on its own and required SRE intervention to force all load balancers to re-announce their floating IPs to switches. Our investigation will focus on how we can automate the recovery of the service given this scenario, as we've had similar occurrences of this in the past. We're also aware of CTR-TEN-AS1 being a SPOF in relation to the dark fibre, so we will look into ways of increasing redundancy there. With regards to the on-going webstore issues, it has been presumed that the reason that was continuing to fail was its reliance on an DNS entry that had its automatic failover disabled. Since DNSmadeeasy provides no audit trail, we will look at ways of regularly snapshotting the configuration to source control so we can trace changes in future. It is presumed that once the TTL for the bad DNS record had expired, the cloud catalogue infrastructure recovered on its own, hence no intervention was required to fix that issue following the re-enabling of the automatic failover. **Conclusions and Actions** The resultant de-briefing identified the following issues with our process: 1. There are multiple locations which explain our incident response process. We should remove redundant copies of the process so that only one is accessible to avoid confusion. In general we were fairly happy with how quickly we responded, however, as this is the second time the load balancer has not recovered following a quick failback, we will prioritise looking at automating this so that we do not need manual SRE intervention in future. We will also look at updating our documentation on how the new cloud catalogue flow works for the SRE team.