Redox incident

Dashboard logs are delayed at this time. Message processing is not impacted.

Major Resolved View vendor source →

Redox experienced a major incident on August 26, 2024 affecting Dashboard Tools, lasting 1d 8h. The incident has been resolved; the full update timeline is below.

Started
Aug 26, 2024, 01:15 PM UTC
Resolved
Aug 27, 2024, 10:03 PM UTC
Duration
1d 8h
Detected by Pingoru
Aug 26, 2024, 01:15 PM UTC

Affected components

Dashboard Tools

Update timeline

  1. investigating Aug 26, 2024, 01:15 PM UTC

    We are currently investigating this issue.

  2. investigating Aug 26, 2024, 03:45 PM UTC

    We are continuing to investigate the issue. Log visibility in the dashboard is unavailable while we work to resolve this.

  3. investigating Aug 26, 2024, 04:18 PM UTC

    Logs are available in the dashboard again. Visibility remains delayed.

  4. monitoring Aug 26, 2024, 08:23 PM UTC

    A fix has been implemented and we are catching up on traffic. We expect to return to regular observability of logs by this evening (8/26).

  5. monitoring Aug 26, 2024, 09:50 PM UTC

    Logs are continuing to catch up. We expect to return to regular observability of logs by 2:00AM CT tomorrow (8/27).

  6. monitoring Aug 27, 2024, 01:16 PM UTC

    Logs observability is back to performing as expected. We will continue to monitor performance throughout the day (8/27).

  7. resolved Aug 27, 2024, 10:03 PM UTC

    Log observations have stabilized and remained performant. This incident has been resolved.

  8. postmortem Sep 05, 2024, 01:17 PM UTC

    # Logs intermittently unavailable to view or search ## Summary From August 25-26, 2024 Logs were intermittently delayed or unavailable to view or search in the Redox dashboard. Message processing was unaffected. ## What Happened & How We Responded * On the morning of August 25, AWS initiated an automated failover of their managed database service due to an underlying storage volume issue which subsequently affected throughput of our logs processing. * On August 25th at 0539CT, we restarted impacted application processes which resolved the immediate issue. * On August 26th at 0758CT, we were alerted that logs were again falling behind in processing time. Working with AWS support, we uncovered that the underlying storage from the failover the previous day was still being optimized, resulting in database write latency. The storage optimization completed at 1500CT, and the service was fully available again at 2229CT. ## What we are doing about this: * We are exploring an underlying storage system change to further increase our infrastructure durability.