CloudBees incident

CloudBees Rollout service incident

Notice Resolved View vendor source →

CloudBees experienced a notice incident on July 15, 2020, lasting 12h 28m. The incident has been resolved; the full update timeline is below.

Started
Jul 15, 2020, 01:11 PM UTC
Resolved
Jul 16, 2020, 01:39 AM UTC
Duration
12h 28m
Detected by Pingoru
Jul 15, 2020, 01:11 PM UTC

Update timeline

  1. identified Jul 15, 2020, 11:41 AM UTC

    There is an outage with a 3rd party service. We're contacting them to see what's the situation.

  2. identified Jul 15, 2020, 12:38 PM UTC

    We have confirmed the database issue with our service provider and they are working to restore service. Service impact is that experiments can’t be updated, but existing flags are unaffected.

  3. identified Jul 15, 2020, 01:11 PM UTC

    Our service provider has updated us about the situation and the rough estimate for recovery is 2 to 3 hours.

  4. identified Jul 15, 2020, 02:08 PM UTC

    Our service provider announced is going to take longer to recover the system. The current rough estimate for recovery is from 5 to 9 hours.

  5. identified Jul 15, 2020, 04:02 PM UTC

    Monitoring https://status.compose.com/ for further updates

  6. identified Jul 15, 2020, 09:07 PM UTC

    We've recieved this message from our service provider: “At this point we are cautiously optimistic. Our engineers are close to having virtual networking up across all hosts in the cluster. So far so good. Once stable we will start bringing capsules back up.” More updates to follow.

  7. identified Jul 15, 2020, 10:42 PM UTC

    IBM Compose update - "Virtual networking is up across all hosts in the cluster and the situation appears to be stable. We are slowly starting data/member capsules. Once those are up, we will start portals which will restore customer access" In parallel - CloudBees engineering teams are now working to restore the database service to our own infrastructure - with a view to failing over if Compose is not able to restore access in a timely manner.

  8. identified Jul 16, 2020, 12:05 AM UTC

    Our Engineering Team has been successfully restored the database but still not 100% operational. More updates to follow.

  9. monitoring Jul 16, 2020, 12:40 AM UTC

    Apart from Impression analytics - which is currently not working - the service is back to operational. We're monitoring the situation.

  10. monitoring Jul 16, 2020, 01:18 AM UTC

    The Rollout core service (API/login/web) outage has been resolved and these services are now fully operational. However, Impression Analytics are not currently available, the engineering team are working to resolve this issue. We will continue to provide service updates on the status of Impression Analytics until the issue is resolved.

  11. resolved Jul 16, 2020, 01:39 AM UTC

    All CloudBees Rollout (Feature Flags) services are now fully operational. We have not identified any data-loss or security impact from this outage. An outage post-mortem and corrective actions will be performed in due course. Thank you for your patience.