Skylight incident

Heroku SSL Service Degradation

Major Resolved View vendor source →

Skylight experienced a major incident on October 25, 2021 affecting Application and Hosting, lasting 1h 33m. The incident has been resolved; the full update timeline is below.

Started
Oct 25, 2021, 08:11 AM UTC
Resolved
Oct 25, 2021, 09:45 AM UTC
Duration
1h 33m
Detected by Pingoru
Oct 25, 2021, 08:11 AM UTC

Affected components

ApplicationHosting

Update timeline

  1. investigating Oct 25, 2021, 08:11 AM UTC

    The Skylight dashboard is inaccessible currently due to a potential configuration issue. This outage also impacted agent authentication – new authentications from agents will not succeed at the moment. Agents that are already authenticated can continue to report data until the authentication session expires.

  2. identified Oct 25, 2021, 08:41 AM UTC

    The dashboard and agent authentication endpoints are affected by a service degradation on our hosting provider Heroku. This outage impacts their "SSL Endpoint" add-on and is expected to last for 8 hours. We began to process to migrate away from the add-on but this is normally expected to take up to 24 hours. We are investigating if there is any way we can speed up the process or re-route the affected endpoints. The data processing pipeline is technically unaffected by this outage as it is hosted on a different provider. However, given that agents are failing to authenticate (and therefore failing to submit traces), we expect this to cause lapses in Skylight data during the outage period. We are very sorry for the inconveniences.

  3. monitoring Oct 25, 2021, 09:35 AM UTC

    We have completed the migration and monitoring the situation. Due to the nature of the incident and it requiring updating our DNS records, it may take some time to fully resolve. The Skylight dashboard (skylight.io) should be immediately accessible assuming your operating system has refreshed the DNS record, which should happen within minutes as we have a TTL of 300 seconds. If you are still unable to access the site, please email [email protected] for assistance. Your Skylight agents should resume reporting data once it retries the previously failed authentication request. If this does not occur, you can try restarting your app, which would force the agent to restart and authenticate again. If that still doesn't work, please email [email protected] for further help. Once again we are very sorry for the trouble.

  4. resolved Oct 25, 2021, 09:45 AM UTC

    Our metrics indicates an agent report rate have recovered to the level before the incident. We believe most customer agents have resumed normal reporting and the issue has been resolved. If you continue to encounter issues, please email [email protected] for assistance. Unfortunately, if your agent was "locked out" from an expired authentication and was unable to report data during the outage, those unreported data will not be available for view on the dashboard. We are truly sorry about this.