Elevio incident

Intermittent downtime for Hosted Knowledge Base

Major Resolved View vendor source →

Elevio experienced a major incident on November 27, 2022, lasting —. The incident has been resolved; the full update timeline is below.

Started
Nov 27, 2022, 10:49 PM UTC
Resolved
Nov 26, 2022, 07:00 PM UTC
Duration
Detected by Pingoru
Nov 27, 2022, 10:49 PM UTC

Update timeline

  1. resolved Nov 27, 2022, 10:49 PM UTC

    On Saturday 27th November there was an incident affecting the Elevio hosted KB which caused it to become intermittently unavailable between 19:10 and 23:30pm UTC. The issue was caused by a resource leak in the KB proxy server, where file descriptors (fd) failed to close after reading a certificate thus rejecting new connections once the fd limit was reached. The issue was caused by a bug in the underlying proxy server software, and was resolved by updating the proxy server software to the latest version. We apologise for this disruption in our service. The issue was particularly difficult to track down since there were no recent updates to the proxy server, and the unfortunate timing of the event (Saturday night / Sunday early morning) meant extra delays in getting the daytime team up to speed

  2. postmortem Nov 27, 2022, 10:50 PM UTC

    Timeline \(in UTC\):19:10pm: The hosted KB becomes unavailable and service restores without intervention in less than 1 minute. 19:16pm: The hosted KB becomes unavailable and service restores without intervention within 2 minutes. 19:50pm - 20:30pm: Multiple events where the hosted KB becomes unavailable for 1-2 minutes and service restores without intervention. On-call engineer escalates the incident with the backend team. 20:30pm - 22:00pm: The service becomes unavailable and no longer restores. Restarting the servers restores service for short periods of time. 22:15pm: The issue is identified and a temporary fix is deployed. Service resumes as normal 23:30pm: A permanent fix is deployed. During this time, the temporary fix had to be momentarily reverted in order to deploy the new version which caused the service to become unavailable for < 5 mins.