Scalyr incident

Graph, alert and dashboard irregularities for some US customers

Minor Resolved View vendor source →

Scalyr experienced a minor incident on August 29, 2022 affecting Main Site, lasting 1d. The incident has been resolved; the full update timeline is below.

Started
Aug 29, 2022, 01:47 PM UTC
Resolved
Aug 30, 2022, 02:16 PM UTC
Duration
1d
Detected by Pingoru
Aug 29, 2022, 01:47 PM UTC

Affected components

Main Site

Update timeline

  1. investigating Aug 29, 2022, 01:47 PM UTC

    Some customers in the US cluster are reporting unexpected behavior including false alarms, inconsistent graphs, and missing search results. We are currently investigating the issue.

  2. identified Aug 29, 2022, 05:40 PM UTC

    We have identified the issue in our timeseries database and are working on remediation.

  3. identified Aug 29, 2022, 09:38 PM UTC

    At 15:00 PDT (22:00 Universal Coordinated Time) we will be restarting our summary service, which powers our alerts and speeds up dashboard rendering. The summary service will be unavailable for approximately 10 minutes, after which it will begin rebuilding time series data for all alerts. Each alert will not be evaluated until its time series is rebuilt, so alerts with longer look back periods will be the last to successfully be evaluated. Loading a dashboard will trigger the time series on that dashboard to be recreated, so that dashboard will initially load more slowly at first.

  4. monitoring Aug 29, 2022, 11:24 PM UTC

    The summary service has been restarted and we are beginning to rebuild time series data for all alerts. Once the time series for an alert has been recreated, we will begin evaluating it again. Alerts with longer look back periods will be the last to successfully be evaluated. Loading a dashboard will trigger the time series on that dashboard to be recreated, so that dashboard will initially load more slowly at first. We are continuing to monitor this process.

  5. resolved Aug 30, 2022, 02:16 PM UTC

    A majority of the time series have been re-built with Dashboard performance having been restored for most customers and Alert evaluation success rates at pre-incident levels. We're marking the issue resolved and expect 100% return to pre-incident levels in the next 24 hours.