Writer experienced a major incident on March 9, 2026, lasting —. The incident has been resolved; the full update timeline is below.
Update timeline
- resolved Mar 09, 2026, 09:45 PM UTC
Envoy pod q7fjq tripped its default ext_authz circuit breaker (max_connections: 1,024) after absorbing 57% of traffic due to a rollout without readiness probes, then getting hit by a GKE node removal reconnection storm. The circuit breaker locked for 17 minutes, returning HTTP 500 for ~62% of auth requests on that pod. 9.4% of all user-facing requests failed (5,203 of 55,449). The system self-recovered when the auth connection latency tail cleared naturally. The other two envoy pods were completely healthy throughout. A skynet-frontend restart at 17:41:15 was coincidental (old pods still alive when CB cleared at 17:42:00).