Coveralls incident

Elevated Latency

Minor Resolved View vendor source →

Coveralls experienced a minor incident on April 24, 2025 affecting Coveralls.io Web and Coveralls.io API, lasting 3h 31m. The incident has been resolved; the full update timeline is below.

Started
Apr 24, 2025, 07:35 PM UTC
Resolved
Apr 24, 2025, 11:06 PM UTC
Duration
3h 31m
Detected by Pingoru
Apr 24, 2025, 07:35 PM UTC

Affected components

Coveralls.io WebCoveralls.io API

Update timeline

  1. investigating Apr 24, 2025, 03:35 PM UTC

    We are investigating elevated latency in our background jobs system. Some users have also reported receiving Timeout errors while trying to load web pages.

  2. identified Apr 24, 2025, 03:54 PM UTC

    The issue has been identified and a fix is being implemented.

  3. monitoring Apr 24, 2025, 04:15 PM UTC

    A fix has been implemented and we are monitoring the results.

  4. monitoring Apr 24, 2025, 04:46 PM UTC

    We are continuing to monitor for any further issues.

  5. monitoring Apr 24, 2025, 05:25 PM UTC

    We believe the issue is resolved. We are scaling infrastructure to clear any delayed background jobs, and monitoring to ensure latency stays within normal range.

  6. monitoring Apr 24, 2025, 06:01 PM UTC

    Monitoring for further issues. Performance for new builds is normal. Waiting for dequeue metrics to fall below 50% normal before we lift "degraded performance" rating.

  7. monitoring Apr 24, 2025, 07:35 PM UTC

    Performance has been restored to standard, and site is fully operational, but we will continue to clear any previously blocked (or retry) jobs we discover in background job queues and monitor performance stats as they clear.

  8. monitoring Apr 24, 2025, 08:57 PM UTC

    The site remains fully operational, and performance for all new builds is normal. We’re continuing to monitor request and query times closely to identify any long-running queries that may have contributed to recent job processing delays or latency spikes.

  9. resolved Apr 24, 2025, 11:06 PM UTC

    We’re now closing this incident, several hours after restoring full system stability. Over the past 4 hours, we’ve continued to monitor key requests and queries closely. During that time, we identified a number of previously long-running queries that we’ve either: - Optimized immediately, based on new platform characteristics; or - Added to a short-term optimization backlog for tuning over the next few days. These efforts are part of our ongoing work to adapt all app queries to the updated infrastructure context.