Coveralls incident

Service disruption for some users

Coveralls experienced a major incident on February 13, 2025 affecting Coveralls.io Web and Coveralls.io API, lasting 7h 41m. The incident has been resolved; the full update timeline is below.

Started: Feb 13, 2025, 01:00 PM UTC
Resolved: Feb 13, 2025, 08:41 PM UTC
Duration: 7h 41m
Detected by Pingoru: Feb 13, 2025, 01:00 PM UTC

Affected components

Coveralls.io WebCoveralls.io API

Update timeline

investigating Feb 12, 2025, 08:54 PM UTC

We are currently investigating reports of service disruptions for some users, possible related to specific subscriptions or repos.
identified Feb 12, 2025, 09:07 PM UTC

The issue has been identified and a fix is being implemented.
monitoring Feb 12, 2025, 09:08 PM UTC

A fix has been implemented and we are monitoring the results.
monitoring Feb 12, 2025, 09:20 PM UTC

While we have applied a fix and are monitoring for any further issues, we are clearing backlogged jobs for some accounts. If you are waiting on some recent builds to complete, please give them at least another 20 minutes to clear. If you are not seeing your builds clear after that, please reach out to us with your org/subscription and repo name(s) at [email protected].
monitoring Feb 12, 2025, 09:43 PM UTC

We are deploying extra servers to help clear backed up jobs. This will entail a rolling reboot, which may cause some users to lose their current connection to Coveralls.io. Your connection should be restored momentarily, so please try again in 30-sec to 1-min.
monitoring Feb 12, 2025, 10:00 PM UTC

Partial outages. We are working to resolve asap.
monitoring Feb 12, 2025, 10:13 PM UTC

We are still experiencing partial outages as we try to deploy across an extended fleet of servers. We are all hands and working to resolve asap.
monitoring Feb 12, 2025, 10:25 PM UTC

We have resolved the partial outages and are monitoring.
monitoring Feb 13, 2025, 05:33 AM UTC

We are continuing to monitor for any further issues.
monitoring Feb 13, 2025, 12:16 PM UTC

We are not clearing background jobs fast enough to recover by morning US PST, so we will be putting the site into read-only mode for about 2 hrs from (4:30a-6:30a US PST) in order to perform some database operations.
monitoring Feb 13, 2025, 01:00 PM UTC

We are in maintenance mode as we perform some database tasks to improve our performance in clearing background jobs still stuck-in-queue. ETA: 2 hrs. But we may need to update this as we monitor progress.
monitoring Feb 13, 2025, 01:33 PM UTC

Use "fail on error" to keep Coveralls 4xx from failing your CI builds / holding up your PRs: While our API is in maintenance mode, new coverage report uploads (POSTs to /api/v1/jobs) will fail with a 405 or other 4xx error. To keep this from breaking your CI builds and holding up your PRs, allow coveralls steps to "fail on error." If you are using one of our Official Integrations, add: - `fail-on-error: false` if using Coveralls GitHub Action - `fail_on_error: false` if using Coveralls Orb for CircleCI - `--no-fail` flag if using Coveralls Coverage Reporter directly Documentation: - Official Integrations: https://docs.coveralls.io/integrations#official-integrations - Coveralls GitHub Action: https://github.com/marketplace/actions/coveralls-github-action - Coveralls Orb for CircleCI: https://circleci.com/developer/orbs/orb/coveralls/coveralls - Coveralls Coverage Reporter: https://github.com/coverallsapp/coverage-reporter Reach out to [email protected] if you need help.
monitoring Feb 13, 2025, 05:15 PM UTC

Our ETA for reverting maintenance mode is within the next 30 minutes.
monitoring Feb 13, 2025, 05:43 PM UTC

We are out of maintenance mode and monitoring live transactions.
resolved Feb 13, 2025, 08:41 PM UTC

We are closing this issue but will continue to monitor as we clear the remaining queues of background jobs from yesterday. If you believe any of your recent builds are still affected (incomplete), or if you are having any issues uploading coverage reports, please reach out to us at [email protected].
postmortem Feb 17, 2025, 05:14 PM UTC

### **Incident Postmortem: Database Partitioning Bottleneck & Job Backlog** #### **Summary** A database partitioning limitation caused a severe backlog of background jobs, leading to degraded build processing times from **Monday, February 10, to Friday, February 14, 2025**. The backlog resulted from excessive autovacuum contention on a high-growth table, which ultimately led to cascading failures in job processing, database performance, and monitoring visibility. #### **Root Cause** Several tables in our production database grow at an accelerated rate. While we employ a partitioning strategy to prevent them from becoming unwieldy, our time-based approach failed to transition a critical table before it reached an unmanageable size. During investigation, we found: * The table was in a perpetual state of autovacuum due to an excessive number of dead tuples. * The high volume of tuples prevented autovacuum from progressing beyond the scanning phase, causing tuple locks. * These locks delayed regular transactions, leading to transaction backups that worsened over time. * By late **Tuesday, February 11**, the backlog had reached a breaking point, causing tens—eventually hundreds—of thousands of jobs to accumulate. ### **Impact** 1. **Background Job Delays** 1. A significant job queue buildup occurred between **February 10 and February 11**. 2. Clearing the backlog took an additional two days \(**February 11–13**\). 3. Failed jobs in long-tail retries prolonged the impact for another **24–36 hours**. 2. **Monitoring Gaps & Alert Failures** 1. Average job duration alerts triggered only after the queue size became a critical issue. 2. As server load increased, monitoring metrics stopped logging, preventing alerts that could have provided earlier intervention signals. 3. **Database & Infrastructure Overload** 1. Scaling up resources to clear the backlog introduced additional database contention due to high transaction volumes, exacerbating delays. 2. The increased database load led to degraded server performance, disconnecting our orchestration layer and APM monitoring. 3. This created a self-reinforcing failure loop that required **continuous manual intervention** from **February 12 to February 13**. #### **Resolution** * We transitioned the affected table, significantly relieving the bottleneck. * We scaled up resources to process the backlog, though this required careful throttling to avoid further database contention. * By **Thursday, February 13**, we placed the site into **maintenance mode for 30 minutes** to reduce load—but ultimately needed nearly **two hours** to restore stability. * To prevent immediate re-saturation, we deferred processing some older jobs to lower-traffic periods. * By **Thursday evening**, build times stabilized as overall traffic declined. * By **Friday morning, February 14**, all remaining queued jobs had processed without further intervention. #### **Next Steps** 1. **Finalizing Database Transitions** 1. To fully resolve performance degradation, we transitioned **two additional tables** closely related to the affected table. 2. This was completed during a **maintenance window on Saturday, February 15 \(8 PM – 11:59 PM PST\).** 2. **Long-Term Database Optimizations** 1. We will perform **VACUUM FULL** on legacy tables to remove ~36B dead tuples and optimize disk layout. 2. Further maintenance windows will be scheduled on late-night weekends. 3. **Partitioning Strategy Enhancements** 1. We are evaluating **size-based partitioning** or a refined **time-based strategy with shorter intervals** to prevent similar issues. 4. **Improved Monitoring & Alerting** 1. We will introduce **earlier warning thresholds** to detect job queue buildup before it becomes critical. 2. We will enhance **database contention monitoring** to catch autovacuum failures and lock contention earlier. #### **Conclusion** Even after **12\+ years in production**, incidents like this remind us of the importance of continually evolving our **data management and monitoring practices**. As Coveralls scales, we are committed to refining our approach to proactively address infrastructure challenges before they affect users. We sincerely apologize to all users affected by this incident. If you need assistance with historical builds or workflow adjustments, or if you'd like to share feedback, please contact us at [**[email protected]**](mailto:[email protected]). Your input will help us shape future improvements.