Hex incident

Degraded kernel acquisition

Hex experienced a critical incident on May 13, 2025 affecting Kernels, lasting 4h 32m. The incident has been resolved; the full update timeline is below.

Started: May 13, 2025, 02:38 PM UTC
Resolved: May 13, 2025, 07:10 PM UTC
Duration: 4h 32m
Detected by Pingoru: May 13, 2025, 02:38 PM UTC

Affected components

Kernels

Update timeline

investigating May 13, 2025, 02:38 PM UTC

We are currently investigating this issue.
investigating May 13, 2025, 04:27 PM UTC

We are continuing to investigate this issue.
monitoring May 13, 2025, 04:58 PM UTC

A fix has been implemented and we are monitoring the results.
identified May 13, 2025, 05:35 PM UTC

Some users are still experiencing issues as we implement a fix.
monitoring May 13, 2025, 06:09 PM UTC

A fix has been implemented and we are monitoring the results.
resolved May 13, 2025, 07:10 PM UTC

This incident has been resolved.
postmortem May 21, 2025, 08:20 PM UTC

After fully analyzing the timeline of events, **we’ve confirmed that the root cause of the incident was an issue in the AWS EBS storage system backing our primary database replica**. The issue began on May 12 11:00 PM PDT according to AWS, and the effect was gradual degradation of performance on the replica, eventually leading to a cascade where even our primary database got backpressured on critical write operations, which triggered the instability causing the incident. The AWS issue was resolved May 13 9:32 AM PDT, after we pushed a database configuration change that caused the EBS system software to reset. We then stabilized and cleaned up systems on our end before resolving the incident. While the root cause of this incident was on the AWS side, we are not satisfied that our detection and response were fast enough. We are investing in the following mechanisms to improve: * We have automated testing to monitor provisioning of kernels, but it is combined with a larger testing suite that makes the signal less immediately actionable. We will be extracting this out as its own monitor so we can more quickly identify and react. * Our database monitoring around latencies will be made more comprehensive, including the replica database, so we do not miss this early signal. * We are improving our incident runbook to cover some of these checks earlier in the process so we zero in on the root cause more quickly.