Coveralls incident

500 Errors

Critical Resolved View vendor source →

Coveralls experienced a critical incident on August 22, 2024 affecting Coveralls.io Web and Coveralls.io API, lasting 2h 17m. The incident has been resolved; the full update timeline is below.

Started
Aug 22, 2024, 07:56 PM UTC
Resolved
Aug 22, 2024, 10:13 PM UTC
Duration
2h 17m
Detected by Pingoru
Aug 22, 2024, 07:56 PM UTC

Affected components

Coveralls.io WebCoveralls.io API

Update timeline

  1. investigating Aug 22, 2024, 07:56 PM UTC

    We have received reports of 500 errors received as responses from the Coveralls API upon coverage report uploads. We are investigating.

  2. identified Aug 22, 2024, 08:08 PM UTC

    The issue has been identified and a fix is being implemented.

  3. monitoring Aug 22, 2024, 09:11 PM UTC

    A fix has been implemented and we are monitoring the results.

  4. monitoring Aug 22, 2024, 09:11 PM UTC

    We are continuing to monitor for any further issues.

  5. monitoring Aug 22, 2024, 09:13 PM UTC

    We will come out of maintenance mode as soon as we confirm the fix.

  6. monitoring Aug 22, 2024, 09:45 PM UTC

    We are resolving a remaining issue.

  7. resolved Aug 22, 2024, 10:13 PM UTC

    This incident has been resolved.

  8. postmortem Aug 23, 2024, 12:14 AM UTC

    **Reason for the incident**: We failed to upgrade to a new RDS CA and update complementary SSL certs on all clients by the expiration date for the previous CA. We misunderstood the potential impact of not making this change in time and planned to make the changes as housekeeping with normal to low priority. In doing so, we failed to prioritize the ticket and make the changes necessary in time to avoid this incident. **Reason for the response time**: While trying to implement the fix, we confronted very verbose documentation that made it hard to understand how to apply the fix in our context, especially while under pressure. When we did identify the correct procedure for our context, and implemented a fix, for some reason we could not get our database clients to establish a connection in production with a freshly downloaded cert that worked in tests from local machines. In the end, we manually copied the contents of the cert into an existing file before our app recognized it. We still don’t know why, but the confusion surrounding this added at least an extra hour to our response time as we cycled through other applicable certs and recovered from failed deployments. **How to avoid the incident in the future**: We will consider all notices from infrastructure providers as requiring review by multiple stakeholders at different levels, and will apply an already established procedure for handling priority infrastructure upgrades in a timely manner, as scheduled events, with review and sign-off.