Coveralls incident

500 Errors

Coveralls experienced a critical incident on August 22, 2024 affecting Coveralls.io Web and Coveralls.io API, lasting 2h 17m. The incident has been resolved; the full update timeline is below.

Started: Aug 22, 2024, 07:56 PM UTC
Resolved: Aug 22, 2024, 10:13 PM UTC
Duration: 2h 17m
Detected by Pingoru: Aug 22, 2024, 07:56 PM UTC

Affected components

Coveralls.io WebCoveralls.io API

Update timeline

investigating Aug 22, 2024, 07:56 PM UTC

We have received reports of 500 errors received as responses from the Coveralls API upon coverage report uploads. We are investigating.
identified Aug 22, 2024, 08:08 PM UTC

The issue has been identified and a fix is being implemented.
monitoring Aug 22, 2024, 09:11 PM UTC

A fix has been implemented and we are monitoring the results.
monitoring Aug 22, 2024, 09:11 PM UTC

We are continuing to monitor for any further issues.
monitoring Aug 22, 2024, 09:13 PM UTC

We will come out of maintenance mode as soon as we confirm the fix.
monitoring Aug 22, 2024, 09:45 PM UTC

We are resolving a remaining issue.
resolved Aug 22, 2024, 10:13 PM UTC

This incident has been resolved.
postmortem Aug 23, 2024, 12:14 AM UTC

**Reason for the incident**: We failed to upgrade to a new RDS CA and update complementary SSL certs on all clients by the expiration date for the previous CA. We misunderstood the potential impact of not making this change in time and planned to make the changes as housekeeping with normal to low priority. In doing so, we failed to prioritize the ticket and make the changes necessary in time to avoid this incident. **Reason for the response time**: While trying to implement the fix, we confronted very verbose documentation that made it hard to understand how to apply the fix in our context, especially while under pressure. When we did identify the correct procedure for our context, and implemented a fix, for some reason we could not get our database clients to establish a connection in production with a freshly downloaded cert that worked in tests from local machines. In the end, we manually copied the contents of the cert into an existing file before our app recognized it. We still don’t know why, but the confusion surrounding this added at least an extra hour to our response time as we cycled through other applicable certs and recovered from failed deployments. **How to avoid the incident in the future**: We will consider all notices from infrastructure providers as requiring review by multiple stakeholders at different levels, and will apply an already established procedure for handling priority infrastructure upgrades in a timely manner, as scheduled events, with review and sign-off.