Geoscape Australia incident

Predictive API service failure

Critical Resolved View vendor source →

Geoscape Australia experienced a critical incident on January 23, 2020 affecting Predictive API, lasting 3d 22h. The incident has been resolved; the full update timeline is below.

Started
Jan 23, 2020, 11:04 PM UTC
Resolved
Jan 27, 2020, 09:11 PM UTC
Duration
3d 22h
Detected by Pingoru
Jan 23, 2020, 11:04 PM UTC

Affected components

Predictive API

Update timeline

  1. identified Jan 23, 2020, 11:04 PM UTC

    The issue has been identified and a fix is being implemented.

  2. monitoring Jan 23, 2020, 11:38 PM UTC

    A fix has been implemented and we are monitoring the results.

  3. resolved Jan 27, 2020, 09:11 PM UTC

    This incident has been resolved.

  4. postmortem Feb 02, 2020, 11:33 PM UTC

    # **What happened?** The Production Predictive API service had a 54-minute outage caused by human error. The disruption to services was immediately identified and recovery actions were directly initiated. **9:21 am** \(start of outage\) * An engineer deleted a component of the Predictive API in the production environment. * The error was immediately identified by both the engineer and automated monitoring. * Recovery actions to restore the service were initiated at once. **10:15 am** \(end of outage\) * Manual testing and automated monitoring confirmed the return of all services. ## **What did we do?** Automated monitoring triggered an outage notification to all customers through [status.psma.com.au](http://status.psma.com.au). We quickly enacted a recovery plan to restore services. Once restored, we monitored manually for a period before going back to automated monitoring. This then allowed us to start our postmortem analysis to identify why this happened and how we can do better. ## **What did we learn?** Manual infrastructure changes are rare given our use of ‘infrastructure as code’. Still, when they are required, clear labelling of components becomes very important. What works for code may not be enough for humans. We were unhappy with the speed of automated deployment in the recovery process. ## **What are we going to do?** * Improve the labelling of cloud infrastructure and components to be more straightforward and explicit \(not just good for automated deployment\) to prevent confusion. * Improve recovery processes to reduce the time for service restoration. * Improve the accessibility and usefulness of system logs to facilitate more effective investigations.