Geoscape Australia incident

PSMA Cloud and Addresses API outage

Critical Resolved View vendor source →

Geoscape Australia experienced a critical incident on January 11, 2021 affecting Addresses API, lasting 19h 49m. The incident has been resolved; the full update timeline is below.

Started
Jan 11, 2021, 01:05 AM UTC
Resolved
Jan 11, 2021, 08:55 PM UTC
Duration
19h 49m
Detected by Pingoru
Jan 11, 2021, 01:05 AM UTC

Affected components

Addresses API

Update timeline

  1. investigating Jan 11, 2021, 01:05 AM UTC

    We are currently experiencing an outage to a number of services. The team are investigating and will provide an update as soon as possible. Best Regards, Geoscape Support Team Geoscape Australia T: +61 (0)2 6260 9000 E: [email protected] https://support.geoscape.com.au

  2. identified Jan 11, 2021, 02:38 AM UTC

    The issue is related to an internal error within our hosting provider's environment. We are currently working with them to resolve this as soon as possible.

  3. monitoring Jan 11, 2021, 07:48 AM UTC

    The services have been restored and we will continue to monitor the results

  4. resolved Jan 11, 2021, 08:55 PM UTC

    This incident has been resolved.

  5. postmortem Jan 18, 2021, 10:55 PM UTC

    # What happened? * All PSMA Cloud functions and the Addresses API address search calls experienced an outage from 11:44am to 6:43pm AEDT. * The Online Data Delivery Service experienced an outage from 11:44am to 12:48pm AEDT. * The outages were caused by the failure of a switch in our hosting provider's network. There was a significant delay to the return of PSMA Cloud services as a critical database server could not be restored by the hosting provider using the normal processes. **11:44am** * Start of outage **11:48am** * Geoscape's monitoring notifications indicate that a number of servers and their associated services are experiencing an outage. * Investigation reveals that the affected servers are all related to a third-party hosting provider. * The hosting provider is contacted and the outage is confirmed to be within their environment and they raise a P1 incident. **12:05pm** * A Statuspage incident is raised to advise customers of the outage. **12:15pm** * The hosting provider identifies that the affected servers have disconnected from the storage channel paths, preventing them from seeing any datastores. Remediation action commences. **12:50pm** * Some servers are returned to operation. The Online Data Delivery Service becomes available again. * The hosting provider notes that a database server is failing to start up. A backup restore is commenced for this server. **2:20pm** * Additional servers are brought online. * The hosting provider identifies the root cause to be a failure of a storage switch module. **3:00pm** * Restoring the remaining database server continues to be unsuccessful due to corruption issues. Another attempt is made to recover the server. **3:45pm** * The corruption errors continue for the database server so a decision is made to deploy a Geoscape database backup onto a previously de-commissioned server. Work commences to re-commission the server and restore the database. **5:40pm** * Testing of this server and database is successful. The hosting provider is requested to update the server configuration, firewall and network rules so that it replaces the existing failed database server. **6:43pm** * As soon as the configurations are complete, services are restored \(End of outage\) # What did we learn? * The hosting provider had not tested server backups. * The extra activity that Geoscape conducts to backup databases 'just in case' was validated. * Moving our infrastructure to a new cloud-based host is a good thing. # What are we going to do? * The hosting provider has been requested to confirm that all server backups are tested on a monthly basis. * Continue efforts to move all PSMA Cloud infrastructure to the new cloud-based host.