ServiceChannel experienced a critical incident on December 12, 2023 affecting SFTP, lasting 4h 52m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Dec 12, 2023, 03:37 PM UTC
ServiceChannel is currently investigating an issue that prevents users from connecting to our SFTP servers from the internet. We are working to restore service as soon as possible. Thank you for your patience.
- monitoring Dec 12, 2023, 07:07 PM UTC
A fix has been implemented. We are monitoring the results.
- resolved Dec 12, 2023, 08:30 PM UTC
This incident has been resolved.
- postmortem Dec 14, 2023, 11:36 PM UTC
**Incident Report: SFTP Service Disruption** **Date of Incident:**` ` 12/12/2023 **Time/Date Incident Started:** 12/11/2023, 05:43 pm EST **Time/Date Stability Restored:**` `12/12/2023, 01:24 pm EST **Time/Date Incident Resolved:**` `12/12/2023, 01:54 pm EST **Users Impacted:** Few **Frequency:** Continuous **Impact:** Major **Incident description:** On December 11th at 5:43 pm EDT, an unexpected disruption occurred in the Production ServiceChannel SFTP service. By the morning of December 12th, 2023, the ServiceChannel Support team began to receive customer reports of timeout errors when attempting to connect to the ServiceChannel SFTP server. **Root Cause Analysis:** A comprehensive investigation by the Site Reliability Engineering \(SRE\) team revealed no resource contention issues with the affected server instance. Nevertheless, to preemptively avoid any hardware bottleneck issues, the SRE team performed a scale-up of the server instance to the next larger instance size. Despite this effort, tests indicated ongoing issues with external connections to port 22, while all internal network tests were successful. The SRE team shifted their efforts to pinpoint potential network irregularities and found that the security policy governing the SFTP server had been altered to exclude access to port 22. Upon further investigation with the Security team, we determined that this change was part of a broad initiative to harden our platform's security posture. Regrettably, this policy update was executed without the normal change management process, and the the broader engineering organization was not notified in advance. This network modification was subsequently reversed, and SFTP functionality was restored. **Actions Taken:** 1. The SRE team inspected the SFTP server and confirmed it was operating within defined parameters. The team also proactively scaled up the infrastructure to proactively address the possibility of any system bottlenecks. 2. The SRE team identified a suspected change in the security policy, wherein Port 22 access was removed for all but private network address spaces. System event logs confirmed that this change was implemented by the security team. Upon identifying the issue, the Security team was informed, and an emergency rollback was requested. **Mitigation Measures:** In light of this incident, the following preventative measures have been put in place: 1. Improvements to internal communications, including ensuring that all network changes are announced and approved by the wider engineering organization prior to their implementation. 2. Ensuring that going forward, Infrastructure changes to the ServiceChannel Platform will be made by the SRE team using the normal Infrastructure as Code process. 3. Additional monitoring of SFTP infrastructure using both network ping tests and end-to-end synthetic transaction tests have been implemented to test from both internal and external network paths.