Coveralls incident
Service Disruption due to "Invalid SSL Certificate"
Coveralls experienced a critical incident on January 22, 2025 affecting Coveralls.io Web and Coveralls.io API, lasting 29m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Jan 22, 2025, 03:08 PM UTC
We are currently investigating this issue.
- identified Jan 22, 2025, 03:27 PM UTC
The issue has been identified and a fix is being implemented.
- monitoring Jan 22, 2025, 03:33 PM UTC
A fix has been implemented and we are monitoring the results.
- resolved Jan 22, 2025, 03:37 PM UTC
This incident has been resolved.
- postmortem Jan 22, 2025, 06:09 PM UTC
**Incident**: Service Disruption Due to Failed SSL Certificate Renewal * **Date**: January 22, 2025 * **Duration**: 17 minutes \(07:00-07:17 UTC\) * **Impact**: Service interruption due to SSL certificate issue **Summary**: Coveralls experienced a brief service disruption when our automated SSL certificate renewal process failed. While our SSL certificates auto-renew 30 days before expiration, one unreachable server prevented the renewal process from completing successfully. **Timeline**: * Prior to incident: Multiple automated renewal attempts unsuccessful * 07:00 UTC: Service disruption began * 07:17 UTC: Service restored after infrastructure adjustment **Root Cause**: The incident occurred when one server became unreachable during our SSL certificate auto-renewal process. While our certificates are configured to auto-renew, the renewal process requires successful deployment across our infrastructure. The unreachable server prevented this deployment, ultimately leading to an outage due to “certificate expiration.” **Resolution**: We identified and removed the problematic server from our infrastructure, allowing the SSL certificate renewal and deployment to complete successfully. **Preventive Measures**: 1. Enhanced monitoring for SSL renewal processes 2. Improved early warning system for similar infrastructure issues 3. Updated incident response procedures \(new SOP\) 4. Additional automated health checks We apologize for any disruption this caused and continue working to improve our infrastructure reliability.