Skylight incident

Expired Certificates

Critical Resolved View vendor source →

Skylight experienced a critical incident on July 10, 2022 affecting Application and Application, lasting 40m. The incident has been resolved; the full update timeline is below.

Started
Jul 10, 2022, 04:26 PM UTC
Resolved
Jul 10, 2022, 05:06 PM UTC
Duration
40m
Detected by Pingoru
Jul 10, 2022, 04:26 PM UTC

Affected components

ApplicationApplication

Update timeline

  1. identified Jul 10, 2022, 04:26 PM UTC

    Our automation have failed to renew/replace the SSL certificates before they expired. We are currently replacing the certificates manually. In the meantime, both the Skylight dashboard and the data collecting servers (used by the agents to submit traces) are inaccessible. We are sorry for the inconvenience.

  2. monitoring Jul 10, 2022, 05:00 PM UTC

    We have deployed new certificates. All issues should be resolved now, although it may take a while for the agents to notice they are able to reattempt the authentication process. This process can be sped up (or in some case necessary if the agent had given up) by restarting the Rails app, which resets the agent. Please reach out to support if you need additional assistance. As part of the response to the Heroku security incident earlier this year (https://status.heroku.com/incidents/2413), a token that was used to upload certificates was invalidated without being replaced. Since the renewal script runs rarely the problem was not noticed until today. We had previously set up secondary monitoring to alert us when a certificate is nearing its expiration date (i.e. the automation did not work as expected), but it appears that secondary monitoring had also failed for a different reason. We are very sorry for the inconvenience.

  3. resolved Jul 10, 2022, 05:06 PM UTC

    We believe all the associated issues with the certificate expiration has been resolved. If you need further assistance, please reach out to support. Sorry again for the inconvenience.