CommentSold incident
Apps loading slowly - admin interface unavailable
CommentSold experienced a critical incident on March 31, 2021 affecting CommentSold /Admin and CommentSold App API, lasting 56m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Mar 31, 2021, 01:14 AM UTC
Investigating an issue related to the apps loading slowly and backend admin interface unavailable. Will report back ASAP - our entire team is investigating currently
- investigating Mar 31, 2021, 01:41 AM UTC
We are continuing to investigate - the database connections have stabilized but latency is still high. We are thoroughly investigating and working on deploying changes to alleviate the issues as quickly as possible
- monitoring Mar 31, 2021, 01:47 AM UTC
We’ve applied fixes related to the service and have gotten the service back to regular latency and response times. We’re monitoring currently
- resolved Mar 31, 2021, 02:10 AM UTC
Should be completely resolved now Queues are catching back up so comments are being processed - expect it to be fully caught up within the next 2 minutes All comments will be read
- postmortem Mar 31, 2021, 03:58 PM UTC
Yesterday at approximately 8pm CDT the CommentSold platform encountered a performance issue. This manifested as very slow load times for the admin portal, mobile apps, and account page. There was gradual improvement in response times starting at 8:30pm CDT, and by 8:45pm CDT the issue was largely resolved, with some lingering effects until 9:10pm CDT. The investigation of this event is still underway, but initial findings indicate that the issue was caused by hitting a limit with how quickly connections could be established to our primary database. The monitoring we have in place had previously been focused on the load on the database itself, which remained normal throughout the performance event. We're taking several steps to prevent this issue from happening again. First, we're including additional monitoring to our database to alert us of potential issues like this before they become problematic. Second, we're making changes to move connections from our primary to replica database servers. Finally, we'll also be changing how we manage these database connections to ensure that we can scale well beyond the amount of traffic we currently handle.