Hex incident

Some users are experiencing issues accessing Hex projects

Hex experienced a critical incident on August 20, 2025 affecting Main site, lasting 1h 26m. The incident has been resolved; the full update timeline is below.

Started: Aug 20, 2025, 05:00 PM UTC
Resolved: Aug 20, 2025, 06:26 PM UTC
Duration: 1h 26m
Detected by Pingoru: Aug 20, 2025, 05:00 PM UTC

Affected components

Main site

Update timeline

investigating Aug 20, 2025, 05:12 PM UTC

We are currently investigating this issue.
investigating Aug 20, 2025, 05:32 PM UTC

We are continuing to investigate this issue.
investigating Aug 20, 2025, 05:50 PM UTC

We are continuing to investigate this issue.
identified Aug 20, 2025, 06:25 PM UTC

The issue has been identified and a fix is being implemented.
monitoring Aug 20, 2025, 06:58 PM UTC

A fix has been implemented and we are monitoring the results.
resolved Aug 20, 2025, 07:50 PM UTC

This incident has been resolved.
postmortem Aug 26, 2025, 01:15 PM UTC

After fully analyzing the timeline of events, **we've confirmed that the root cause was a cascading failure triggered by the combination of a feature flag, deployment timing during peak traffic, and an inefficiency in our database connection management**. The issue began August 19 at 4:03 PM PDT, when a feature flag was updated to have clients query the server more frequently to eliminate a rare display issue in the client. It escalated August 20 at 10:00 AM PDT when we did a production deployment during peak traffic to add necessary support for a new feature release. This combination created a thundering herd problem where client reconnection attempts exponentially increased system pressure, saturating our database connections and triggering system-wide failures. Service was restored for most users at 10:52 PM PDT after manual intervention, and then fully restored at 11:26 AM EDT for all users after resetting backend pods. While this incident has been resolved, we are not satisfied by this sequence of events, from both the risk management perspective and the detection / response perspective. We are investing in the following mechanisms to improve: * We have increased baseline server capacity and reverted the change to the feature flag that increased database load. We are also optimizing database connection usage to handle traffic spikes via better pooling and reuse. * Our monitoring around system load and client connection patterns will be made more comprehensive, including proactive alerts on key server-side load and performance metrics for request processing, so we can identify and react to issues before they impact service. * We are improving our client-server reconnection protocol during releases to include rate limiting and exponential backoff, and investigating further advancements so we can handle deployment rollovers more gracefully without service disruption. * Until the improvements above are in place, we will be avoiding deployments during peak traffic times, unless absolutely necessary to resolve major issues with service stability.