Coveralls incident

Elevated 504 Timeout Errors

Coveralls experienced a notice incident on September 3, 2025 affecting Coveralls.io Web and Coveralls.io API, lasting 21d 6h. The incident has been resolved; the full update timeline is below.

Started: Sep 03, 2025, 06:00 PM UTC
Resolved: Sep 25, 2025, 12:53 AM UTC
Duration: 21d 6h
Detected by Pingoru: Sep 03, 2025, 06:00 PM UTC

Affected components

Coveralls.io WebCoveralls.io API

Update timeline

monitoring Sep 03, 2025, 06:00 PM UTC

We’re currently seeing elevated reports of 504 Timeout errors affecting some customers on a subset of Coveralls pages, including: - Source File pages - Repo pages - Add Repos pages All systems and pages are generally operational; a subset of customers are experiencing these errors, sometimes intermittently. There is a public tracking issue for the Source File timeout errors here: https://github.com/lemurheavy/coveralls-public/issues/1757 Fix in progress: We’re implementing a short-term fix over the next 24–48 hours, which should eliminate the timeouts. A longer-term fix is also planned, but will roll out over several weeks, but early phases of that implementation should also reduce the request times that were originally triggering the 504 timeouts. What you can do: If you're currently affected, we recommend following updates here, and subscribing to the public issue: https://github.com/lemurheavy/coveralls-public/issues/1757 If your issue pattern differs from above, or you suspect a different root cause, reach out to [email protected], and we'll verify for you.
monitoring Sep 05, 2025, 04:05 PM UTC

We are still working on a near-term fix. We will post here, and here when complete: https://github.com/lemurheavy/coveralls-public/issues/1757
monitoring Sep 08, 2025, 06:57 PM UTC

All systems operational. Continuing to keep this open until we have released our short-term fix into production. Subscribe for updates at this status page, or follow this public tracking issue for updates: https://github.com/lemurheavy/coveralls-public/issues/1757
monitoring Sep 09, 2025, 03:24 PM UTC

All systems operational. We have released one of two parts of a near-term solution into production resolving a minority subset of 504 errors. We are still working on releasing part two into production. Subscribe for updates at this status page, or follow this public tracking issue for updates: https://github.com/lemurheavy/coveralls-public/issues/1757
monitoring Sep 12, 2025, 03:44 PM UTC

All systems operational. Earlier today (6:45–7:45 AM PDT), we received elevated reports of 504 timeout errors. We have not been able to reproduce the issue since, but if you are still experiencing errors, please contact us at [email protected]. The affected areas may include: - Coverage Report Uploads (/api/v1/jobs) - Add Repos Page - Repo Page - Source File Page Fixes for the Add Repos, Repo, and Source File pages are scheduled to be deployed by end of day (PDT).
monitoring Sep 15, 2025, 06:57 PM UTC

Mitigation in place. All systems operational. This morning we deployed additional capacity and autoscaling measures to reduce 504 errors on coverage report uploads: - Doubled our web server fleet (on top of the prior doubling when this issue began). - Enabled autoscaling at the web layer, allowing the fleet to double again automatically when NGINX response times exceed thresholds. The underlying trigger remains rare surges of upload requests from outlier repositories (750–1250 uploads per build). While we have paused processing for these repos, our HTTP servers must still handle the incoming requests until they stop. Timezone coverage: As a small team based in Los Angeles (PDT), our ability to respond in real time is most limited overnight (10p–6a PDT). Unfortunately, the primary outlier repos are in APAC, making this the window of highest risk. With these changes, we hope to reduce the occurrence of upload 504s during this window. We will monitor results closely and continue tuning autoscaling thresholds. Please let us know if you continue to see 504 errors on uploads.
monitoring Sep 22, 2025, 04:43 PM UTC

Mitigated – Monitoring All systems operational. Recent mitigations, including fleet expansion and autoscaling, have reduced 504 timeout reports significantly. The remaining reports are infrequent and occur mostly during overnight and weekend hours (PDT). We are continuing to monitor closely and are working on a multi-part solution to eliminate all known causes. Until then, we are keeping this incident open in Monitoring. We will close it once 504 errors have returned to being unexpected, isolated events.
monitoring Sep 23, 2025, 07:14 PM UTC

Fix for unrelated 500 errors: If you receive a `500` error with this error message format: > ⚠️ Internal server error. Please contact Coveralls team. Please know it is unrelated to the `504` errors being monitored in this open incident. Those, intermittent `500` errors are caused by a regression in one of the latest coverage-reporter releases: `v0.6.16` or `v0.6.17`. Workaround: Pin your coverage-reporter-version to `v0.6.15` in your integration config. For thorough instructions, see this public issue: https://github.com/coverallsapp/coverage-reporter/issues/180 We’re investigating the root cause and will post updates once a fix is released.
resolved Sep 25, 2025, 12:53 AM UTC

500 Internal Server Errors on Uploads The recent 500 error surfacing during some coverage uploads as: > ⚠️ Internal server error. Please contact Coveralls team. has been resolved. A full postmortem will be published here soon. In the meantime, you can find more detail in the main tracking issue: https://github.com/coverallsapp/coverage-reporter/issues/180 Summary The root cause was ultimately infrastructure-related, not a regression in recent coverage-reporter releases. The previous workaround of pinning your coverage-reporter version is therefore not required. We have decided to close this incident, which we intentionally kept open for over a week to track a series of 504 and 5xx issues with overlapping root causes. In hindsight, the broadened scope made updates less clear than we'd hoped. With today’s resolution and the mitigations applied throughout the week, the occurrence of 504 errors during uploads (POSTs) has been significantly reduced. Going forward, any new 504 errors should be considered unexpected, isolated events. At the same time, we continue work on several instances of intermittent GET-related 504 errors affecting: - Source File pages - Repo pages - Add Repos pages Progress on those issues will be reported separately here: https://github.com/lemurheavy/coveralls-public/issues/1757
postmortem Sep 25, 2025, 04:35 PM UTC

**This is a postmortem on this specific issue:** Intermittent 500 Errors on Coverage Uploads **Summary** Between September 20–24, some customers experienced intermittent `500 Internal Server Error` responses during coverage uploads \(`POST /api/v1/jobs`\). The issue was initially hard to diagnose because: * Failures did not surface reliably in our error tracker \(BugSnag\). * They appeared to affect only some requests, some customers. **Impact** * Some coverage uploads failed to process, causing build reporting delays or gaps. * Frequency was low enough to appear intermittent, which delayed detection and resolution. **Timeline** * **Sep 20–23**: First customer reports of intermittent 500s. Initial theories involved a regression in a recent release of our coverage-reporter integration \(client-side\). * **Sep 23–24**: Deep log analysis across ELB and application logs revealed errors concentrated on a single web server. * **Sep 24**: Confirmed that server alone was responsible for thousands of SSL-related failures \(`Faraday::SSLError`, `OpenSSL::SSL::SSLError`, `Seahorse::Client::NetworkingError`\). Other servers were clean. * **Sep 24**: Mitigation: that server was destroyed. Errors ceased immediately. **Root Cause** In terms of possible cause, we believe this additional Web server was provisioned during autoscaling with a different Ubuntu version than the rest of the fleet. This seemed to result in a broken or outdated CA certificate store, causing outbound SSL connections \(GitHub, Travis, etc.\) to fail intermittently—but then bubble up to a `500` error for the original request \(`POST` to `/api.v1/jobs`\). **Resolution** * Problematic server removed from service. * Future mitigation: verify baseline OS/version and CA store when adding new servers, especially via automation. * Next step: document the correct procedure to disable a single server in Cloud66 load balancers, instead of outright destroying, so we can retain the server for forensic investigations. **Lessons Learned** * Errors can hide if they don’t surface in the bug tracker. Direct log analysis is essential. * Even one misconfigured server can cause significant customer impact. * Consistency in OS/base image, and CA, is critical.