Cadmium experienced a critical incident on April 18, 2024 affecting EthosCE, lasting 17h 33m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Apr 18, 2024, 06:30 PM UTC
Our team is aware there is a service disruption and is working swiftly to identify the root cause.
- monitoring Apr 18, 2024, 06:40 PM UTC
A fix has been implemented and we are monitoring the results.
- resolved Apr 19, 2024, 12:04 PM UTC
This incident has been resolved.
- postmortem Apr 29, 2024, 08:57 PM UTC
During this outage, the router pods responsible for directing traffic from the internet to customer EthosCE sites failed to normally route traffic. The system attempted to automatically restart the router pod but the restarts did not succeed and eventually “backed off” in order to avoid a loop condition. An engineer manually deleted the router pods, which respawned and the system returned to normal operations.