Fasterize incident

Performance degradation

Minor Resolved View vendor source →

Fasterize experienced a minor incident on October 19, 2023 affecting Acceleration, lasting 15h 17m. The incident has been resolved; the full update timeline is below.

Started
Oct 19, 2023, 04:04 PM UTC
Resolved
Oct 20, 2023, 07:22 AM UTC
Duration
15h 17m
Detected by Pingoru
Oct 19, 2023, 04:04 PM UTC

Affected components

Acceleration

Update timeline

  1. investigating Oct 19, 2023, 04:04 PM UTC

    We currently have some issues on our european infrastructure. Being fixed. Slight impact on acceleration. Some pages can have some slowdowns. Some optimizations are disabled.

  2. identified Oct 19, 2023, 04:35 PM UTC

    We have mitigated the issue. Performance is back to normal. Still investigating for the root cause.

  3. monitoring Oct 19, 2023, 04:49 PM UTC

    We're monitoring the results but everything's fine. Seems to be related to a schema change in a storage component (to be confirmed after the RCA).

  4. resolved Oct 20, 2023, 07:22 AM UTC

    This incident has been resolved at 18h25 (Paris time). A post mortem will follow.

  5. postmortem Oct 23, 2023, 09:21 PM UTC

    # Description On Thursday, October 19th, between 4:55 PM UTC\+2 and 6:25 PM UTC\+2, Fasterize european platform was unable to optimize web pages for all customers. The original version was then delivered. We discovered that between 4:45 PM UTC\+2 and 5:50 PM UTC\+2, a specific request was made that caused a failure in the Fasterize engine during optimization and left the process in a non-functional state. The number of functional processes then decreased until it fell below a critical threshold. Our engine then automatically switched to a degraded mode where pages were no longer optimized and served without delay. At 5:29 PM UTC\+2, the oncall team manually added capacity to the platform to return to a stable state, but this did not definitely improve the situation. Starting from 6:15 PM UTC\+2, the optimization processes gradually resumed traffic. The engine then returned to its normal mode of operation. To prevent any further incidents, the request has been excluded from optimizations and a fix on the optimization engine is being developed. ## Action plan **Short term:** * Fix the engine to optimize the responsible request without any crashes **Medium term:** * Review the health check system at the engine level to automatically restart non-functional processes