Coveralls incident

Delayed Coverage Calculations for Some Users

Major Resolved View vendor source →

Coveralls experienced a major incident on June 2, 2025 affecting Coveralls.io Web and Coveralls.io API, lasting 18h 56m. The incident has been resolved; the full update timeline is below.

Started
Jun 02, 2025, 03:55 PM UTC
Resolved
Jun 03, 2025, 10:52 AM UTC
Duration
18h 56m
Detected by Pingoru
Jun 02, 2025, 03:55 PM UTC

Affected components

Coveralls.io WebCoveralls.io API

Update timeline

  1. investigating Jun 02, 2025, 01:42 PM UTC

    We are currently investigating this issue.

  2. identified Jun 02, 2025, 02:18 PM UTC

    The issue has been identified and a fix is being implemented.

  3. identified Jun 02, 2025, 02:20 PM UTC

    We need to pause processing momentarily to clear a backlog of DB connections. We cutover to a new database version this weekend and even after months of planning and preventative steps, during periods of elevated usage after such a change it's still common for planner regressions to occur. We will identify the offending SQL statements, fix their planner issues, and restart work as soon as possible. Thanks for your pateince as we work though this as quickly as possible.

  4. identified Jun 02, 2025, 02:52 PM UTC

    We are continuing to work on a fix for this issue.

  5. identified Jun 02, 2025, 03:06 PM UTC

    We are continuing to work on a fix for this issue.

  6. monitoring Jun 02, 2025, 03:34 PM UTC

    A fix has been implemented and we are monitoring the results.

  7. monitoring Jun 02, 2025, 03:55 PM UTC

    We’re currently experiencing an outage due to unexpected query planner behavior following our recent upgrade to PostgreSQL 16. Despite extensive preparation and testing, one of our core background queries began performing full table scans under the new version, causing a rapid increase in load and job backlog. What we're doing: - We’ve paused background job processing to stabilize the system. - We tried all "quick fixes" like adjustments to DB params that affect planner choices—all to no effect. - We're now actively deploying a targeted database index to resolve the performance issue. - We’ve identified a longer-term fix that will make the query safer and more efficient on the new version of PostgreSQL. Why this happened: PostgreSQL 16 introduced changes to how certain types of queries are planned. A query that performed well in PostgreSQL 12 unexpectedly triggered a much more expensive plan in 16. We're correcting for that now. Estimated recovery: Background job processing is expected to resume within 20–40 minutes, with full service restoration shortly thereafter. We’ll continue to post updates here as we make progress. Thanks for your patience — we’re on it.

  8. monitoring Jun 02, 2025, 04:08 PM UTC

    We have completed implementation of our fix. We are cautiously resuming background processing and will continue monitoring closely. If you notice any delays in build processing, rest assured they will be resolved shortly. Thank you for your patience — more updates will follow as we return to full capacity.

  9. monitoring Jun 02, 2025, 04:51 PM UTC

    All systems operational. We are carefully scaling resources and monitoring database performance to ensure stable recovery. Some delays in build and coverage report processing may still be observed as we restore full capacity. Thank you for your continued patience — we’ll share further updates as recovery progresses.

  10. investigating Jun 02, 2025, 07:10 PM UTC

    While monitoring we have discovered some additional planner anomalies that are slowing down queries associated with our various calculation jobs. We are investigating those again and working to identify and implement a fix. We will continue posting updates here.

  11. identified Jun 02, 2025, 08:48 PM UTC

    The issue has been identified and a fix is being implemented.

  12. identified Jun 02, 2025, 10:05 PM UTC

    We are continuing to work on a fix for this issue.

  13. monitoring Jun 02, 2025, 10:54 PM UTC

    A partial fix has been implemented and we are monitoring the results.

  14. monitoring Jun 03, 2025, 12:50 AM UTC

    We are continuing to monitor for any further issues.

  15. monitoring Jun 03, 2025, 03:05 AM UTC

    We have implemented another fix and are monitoring the results.

  16. resolved Jun 03, 2025, 10:52 AM UTC

    This incident has been resolved, but we will continue monitoring closely. All systems are operational, but we will leave systems category at Degraded Performance until we have fully cleared a backlog of background processing jobs.