UpGuard incident

Database Outage

Critical Resolved View vendor source →

UpGuard experienced a critical incident on June 6, 2022 affecting Web App and External API and 1 more component, lasting 6h 48m. The incident has been resolved; the full update timeline is below.

Started
Jun 06, 2022, 03:35 AM UTC
Resolved
Jun 06, 2022, 10:23 AM UTC
Duration
6h 48m
Detected by Pingoru
Jun 06, 2022, 03:35 AM UTC

Affected components

Web AppExternal APIAuthentication

Update timeline

  1. investigating Jun 06, 2022, 03:44 AM UTC

    We are currently investigating this issue.

  2. investigating Jun 06, 2022, 03:49 AM UTC

    We are continuing to investigate this issue.

  3. identified Jun 06, 2022, 03:59 AM UTC

    A database backup has been restored and we will begin bringing systems back online.

  4. monitoring Jun 06, 2022, 04:07 AM UTC

    Systems are back online and we are currently investigating logs as to which users have been affected.

  5. resolved Jun 06, 2022, 10:23 AM UTC

    Systems are stable, and affected users will be contacted soon.

  6. postmortem Jun 15, 2022, 05:51 AM UTC

    PIR Date: 17th June, 2022 Incident Date: June 6th, 2022 Incident Time: 3:33 UTC Incident Number: INCI-159 Severity Level: 1 - Blocker Affected Services: UpGuard CyberRisk, Web App, External API, Authentication services Outage Duration: 30 Minutes # Incident Summary On Monday, June 6th at 3:33 UTC, the UpGuard CyberRisk, Web App, External API & Authentication services experienced an outage of 30 minutes, and recovery from this outage led to the loss of 18 hours of data affecting <1% of our customers. # Fault A database maintenance task commenced on UpGuard CyberRisk. The production database was incorrectly overwritten, halting access to UpGuard CyberRisk, Web App, External API & Authentication services. # Detection Internal alerting systems notified internal channels immediately of the service disruption across UpGuard CyberRisk, Web App, External API & Authentication services. # Impact 1. Outage: UpGuard CyberRisk, Web App, External API & Authentication services were unavailable for 30 minutes. All performance transactions within the product were halted as a result. 2. Data loss: The database backup restored was from the previous day Sunday, June 5th at 7:00 UTC. Data entered into UpGuard CyberRisk during the previous 18 hours was lost affecting <1% of our customers. # Recovery 1. Due to the low number of transactions, UpGuard CyberRisk, Web App, External API & Authentication services were restored and brought back online with the last available backup from Sunday, June 5th at 7:00 UTC. 2. Analysis was conducted to review changes that occurred between Sunday, June 5th at 7:00 UTC and Monday, June 6th at 3:33 UTC on UpGuard CyberRisk. # Timeline 3:30 UTC: Database maintenance commenced. 3:33 UTC: It was identified that the database maintenance was incorrectly carried out on the production database instead of the test database due to human error which halted access to UpGuard CyberRisk, Web App, External API & Authentication services. 3:45 UTC: An incident response group was formed. 3:55 UTC: A decision was made to restore from the last available full backup provided the low impact of transactions that were executed. 4:03 UTC: UpGuard CyberRisk, Web App, External API & Authentication services were restored from backup data as of Sunday, June 5th at 7:00 UTC was successful and within our Hosted Services Agreement. Data entered into UpGuard CyberRisk during the previous 18 hours was lost affecting <1% of our customers. # Root Cause ‌ It was concluded that the root cause was human error, along with insufficient testing and verification of the maintenance work. In addition, the change type \(restoring a database image into a non-production copy\) represents a unique case for our change control procedures and was not classified as a production change. Although performed from the production environment, the change classification as non-production was due to the destination being a non-production copy of the database rather than the source target. # Corrective Actions As a result of this incident, we have analyzed all of the transactions within UpGuard CyberRisk between Sunday, June 5th at 7:00 UTC and Monday, June 6th at 3:33 UTC to notify the customers affected with a description of the data loss. **Effective Immediately:** For this category of change, we will ensure it is aligned with all other types of change that potentially impact our customer and production data. This means that it will follow the formal change control process that requires review, testing, and approval. We will increase the frequency of our backups. We will require a backup before any major change to the production environment. **Targeting completion within 1 month:** We are reviewing all other types of changes that fall outside our regular change control process to verify coverage at the appropriate level of control. For any change that requires any manual process, we will ensure that: 1. A scripted solution is present that allows for review and testing. This will include a scripted and tested database restore function. 2. For changes that cannot be scripted a documented playbook is available that allows for peer-review and testing We will be reviewing our external communications plan for our customers to ensure that the relevant and active users are communicated with and ensuring there is an opt out function.