CommentSold incident

Apps loading slowly - admin interface unavailable

CommentSold experienced a critical incident on March 31, 2021 affecting CommentSold /Admin and CommentSold App API, lasting 56m. The incident has been resolved; the full update timeline is below.

Started: Mar 31, 2021, 01:14 AM UTC
Resolved: Mar 31, 2021, 02:10 AM UTC
Duration: 56m
Detected by Pingoru: Mar 31, 2021, 01:14 AM UTC

Affected components

CommentSold /AdminCommentSold App API

Update timeline

investigating Mar 31, 2021, 01:14 AM UTC

Investigating an issue related to the apps loading slowly and backend admin interface unavailable. Will report back ASAP - our entire team is investigating currently
investigating Mar 31, 2021, 01:41 AM UTC

We are continuing to investigate - the database connections have stabilized but latency is still high. We are thoroughly investigating and working on deploying changes to alleviate the issues as quickly as possible
monitoring Mar 31, 2021, 01:47 AM UTC

We’ve applied fixes related to the service and have gotten the service back to regular latency and response times. We’re monitoring currently
resolved Mar 31, 2021, 02:10 AM UTC

Should be completely resolved now Queues are catching back up so comments are being processed - expect it to be fully caught up within the next 2 minutes All comments will be read
postmortem Mar 31, 2021, 03:58 PM UTC

Yesterday at approximately 8pm CDT the CommentSold platform encountered a performance issue. This manifested as very slow load times for the admin portal, mobile apps, and account page. There was gradual improvement in response times starting at 8:30pm CDT, and by 8:45pm CDT the issue was largely resolved, with some lingering effects until 9:10pm CDT. The investigation of this event is still underway, but initial findings indicate that the issue was caused by hitting a limit with how quickly connections could be established to our primary database. The monitoring we have in place had previously been focused on the load on the database itself, which remained normal throughout the performance event. We're taking several steps to prevent this issue from happening again. First, we're including additional monitoring to our database to alert us of potential issues like this before they become problematic. Second, we're making changes to move connections from our primary to replica database servers. Finally, we'll also be changing how we manage these database connections to ensure that we can scale well beyond the amount of traffic we currently handle.