Retention Science incident

Database server out of temporary disk space causing certain sites' recommendations to fail

Retention Science experienced a major incident on April 19, 2024 affecting Cortex Application (Main Dashboard) and Recommendations API, lasting 2d 21h. The incident has been resolved; the full update timeline is below.

Started: Apr 19, 2024, 09:20 PM UTC
Resolved: Apr 22, 2024, 06:25 PM UTC
Duration: 2d 21h
Detected by Pingoru: Apr 19, 2024, 09:20 PM UTC

Affected components

Cortex Application (Main Dashboard)Recommendations API

Update timeline

investigating Apr 19, 2024, 09:20 PM UTC

We are currently investigating, but it seems one of our databases has run out of temporary disk space to unload large tables to our machine learning algorithm. Not all sites are affected, and it seems to primarily be an issue for larger sites (many millions of users). We are looking to remediate this, but also we're trying to find out why this suddenly started happening even though we haven't changed much on the database server side. We will update here with more findings as we have them.
investigating Apr 19, 2024, 10:32 PM UTC

We are continuing to investigate. The issue seems to have started on April 14th when we applied a AWS-required MySQL 5.7 => 8 upgrade to our Subscription service database. This has apparently caused some unforeseen performance issues when running multiple sites' machine learning jobs. We will be upgrading the database instance size in order to sidestep the space issue temporarily. Our hypothesis is that this will buy us some time and (hopefully) allow our big jobs to continue running. Meanwhile, we will be investigating how to make the disk usage more efficient, or resolve the issue overall.
identified Apr 19, 2024, 10:32 PM UTC

The issue has been identified and a fix is being implemented.
resolved Apr 22, 2024, 06:25 PM UTC

We found a missing configuration that did not carry over from the old version of the database to the new. We added this configuration on Friday around 6pm Pacific, which has fixed the problem with our big data operations on our Subscription service database for all clients. We monitored over the weekend, and there were no further errors. This issue has been resolved. We have taken steps to make sure all of our databases have this configuration and will be monitoring any similar issue with other databases going forward.