Altmetric incident

Explorer delay in data processing

Major Resolved View vendor source →

Altmetric experienced a major incident on October 21, 2021 affecting Altmetric Explorer and Data Processing, lasting 21d 18h. The incident has been resolved; the full update timeline is below.

Started
Oct 21, 2021, 03:47 PM UTC
Resolved
Nov 12, 2021, 10:31 AM UTC
Duration
21d 18h
Detected by Pingoru
Oct 21, 2021, 03:47 PM UTC

Affected components

Altmetric ExplorerData Processing

Update timeline

  1. investigating Oct 21, 2021, 03:47 PM UTC

    We are currently investigating the root cause of an issue which is causing: New mentions are not visible in the Altmetric Explorer since 23:59 on 19th Oct Research outputs that have not previously been mentioned, have not appeared in the Altmetric Explorer since 23:59 on 19th Oct The Altmetric score on Publisher badges may not match the score within the Altmetric Explorer Altmetric searches may not be as performant as customers would normally experience The explorer database is usually updated on a nightly basis, this means that one update is currently missing for customers.

  2. identified Oct 21, 2021, 07:13 PM UTC

    We have identified the root cause of the issue which relates to a backend service responsible for preparing the nightly snapshot used to populate the Altmetric Explorer database for our Explorer application and API. Our teams are working to identify the quickest way to restore full access to the latest data while safeguarding the availability of the service. We would like to apologise for the inconvenience to our users and will provide a further update before 0900UTC/1000BST 22nd Oct.

  3. identified Oct 22, 2021, 08:36 AM UTC

    Our major incident team have reconvened this morning to review the overnight processing progress and are assessing the likely recovery time for Altmetric Explorer data. The impact on our Explorer services remains the same and we expect to provide a further update before 12:00UTC/1400BST 22nd Oct

  4. identified Oct 22, 2021, 12:25 PM UTC

    Our major incident team has identified a backlog of processing which is not expected to clear before early next week. Non-essential processing has been paused to allow the Altmetric Explorer processing to take priority. Incremental updates will be posted here as the become available before a full update before 1700UTC/1800BST on Monday 25th October. We're very sorry for any users who are impacted. The Details Pages and Details Page API remain available throughout.

  5. identified Oct 25, 2021, 12:46 PM UTC

    The system is continuing to work through the processing backlog and at current rates we hope to be back up to date within the next 24 hours. This means that vast the majority of data will be correct between the Details Pages and the Explorer once our daily snapshot process has completed on Wednesday 27th October. We will update again tomorrow morning (Tuesday 26th)

  6. identified Oct 26, 2021, 10:29 AM UTC

    The system has successfully worked through the backlog of data, and all data has been successfully synced apart from a small 2 hour period that will require some manual curation to re-sync. We are setting out a plan to conduct this work and will continue to keep this incident updated with progress.

  7. identified Oct 27, 2021, 05:51 PM UTC

    Our incident team have continued to work on validating the accuracy of our data following the incident. We identified: Up to a 2 hour period from 25th Oct that has now been resolved Up to a 7 hour period from 20th Oct The outstanding data from the 20th, requires our teams to rebuild the data in an alternative environment for testing prior to repeating in Production. This is a slow process of recreating and manually checking, which is likely to take us at least 2 weeks and we'll continue to report our progress here. We would like to apologise again for any impact that this incident has had on our users.

  8. identified Nov 03, 2021, 04:15 PM UTC

    Our incident management team have completed the recovery of the missing data and are now synchronising this data across our databases. Doing this while our data is continuously changing means doing so slowly however we have reduced our recovery time objective. We had thought that it could take up to another week to fully recover and test all aspects of the data, we've revised this estimate to the end of this current week. Our teams will continue to work until the incident is resolved and if anything changes in the meantime, we shall update this page. Thank you for being patient while we ensure the stability and integrity of our data.

  9. identified Nov 05, 2021, 04:22 PM UTC

    We can confirm that the data lost from the original incident has now been recovered and fully synchronised across our databases. Unfortunately during this recovery process, it had an unexpected impact on our news processing, and as result we are still behind in processing news mentions from November 2nd onwards. The team are working to catch up and we will have another update on Monday 8th November. We thank you for your continued patience while we ensure the stability and integrity of our data.

  10. identified Nov 08, 2021, 06:55 PM UTC

    We are nearing resolution for the issues currently impacting news mention processing and expect to be able to resume updates as early as tomorrow morning. We expect the current news mention backlog to take some further time to process once the stream is up and running. We will post an update tomorrow with our progress at that time.

  11. identified Nov 10, 2021, 04:52 PM UTC

    Our incident management team have now reached the final stages of this incident and are processing the current news mention backlog. We shall be doing this over the course of the next couple of days so that normal operation remains unaffected. The next update on this incident will be by 1700 UTC/UK on Friday 12th November.

  12. resolved Nov 12, 2021, 10:31 AM UTC

    All customer facing impacts have now been resolved. The incident team will now review the incident, identifying lessons learned and implementing improvements to reduce the likelihood of recurrence and speeding up recovery times.