FatTail incident

AdBook+ performance issue - affecting Portal users ONLY

Notice Resolved View vendor source →

FatTail experienced a notice incident on May 16, 2024 affecting AdBookPORTAL, lasting 2h 45m. The incident has been resolved; the full update timeline is below.

Started
May 16, 2024, 11:35 AM UTC
Resolved
May 16, 2024, 02:21 PM UTC
Duration
2h 45m
Detected by Pingoru
May 16, 2024, 11:35 AM UTC

Affected components

AdBookPORTAL

Update timeline

  1. identified May 16, 2024, 11:35 AM UTC

    We are working to resolve an issue that is preventing users from accessing AdBook Portal. When attempting to login, users will receive a '500 internal server error'. This incident is being worked on with the highest urgency and we will provide additional updates as soon as they are available NOTE This is impacting users of xxxxx.adbookportal.com only. Access to adbookxx.fattail.com is not impacted. Thank you for your patience while we work to restore performance. FatTail Support.

  2. resolved May 16, 2024, 02:21 PM UTC

    Performance has been resolved and users can how access AdBook Portal. A post-mortem of this incident will be provided once full details have been gathered and evaluated by our engineering teams. If you have questions in the interim please contact us via support.fattail.com Thank you, FatTail Support

  3. postmortem May 24, 2024, 03:42 PM UTC

    ### **Summary of Impact** On May 16th, 2024, at approximately 4:39 am ET, FatTail deployed version 3.5.0 of AdBookPortal. During the release there was a deployment issue that caused a loss of connectivity to one of the databases. Consequently, Portal experienced an outage that lasted from 4:39 am ET to 9:59 am ET. ### **Root Cause** The incident was triggered when a feature that syncs data into our data warehouse was reverted in our production database during deployment. The issue was not surfaced on other lower environments during regression testing because the data warehouse sync had not yet been configured on those environments. At 9.59 am ET, connectivity to the database and its containers was restored. There was no data loss as a result of this incident. ### **Next Steps** To prevent the issue from recurring, the following will be implemented: * We’ve altered our staging infrastructure and deployment pipelines for consistency with production. * We will be updating our deployment pipelines to include a manual step to validate the deployment plan before applying it. * We have already implemented changes to the site infrastructure which enables us to recover from an event such as this more quickly, in line with our aim of keeping system downtime to an absolute minimum.