UserVoice incident

503 errors in the UserVoice admin console

Major Resolved View vendor source →

UserVoice experienced a major incident on October 19, 2018 affecting Web Portal (subdomain) and Admin Console and 1 more component, lasting 5h 32m. The incident has been resolved; the full update timeline is below.

Started
Oct 19, 2018, 05:05 PM UTC
Resolved
Oct 19, 2018, 10:38 PM UTC
Duration
5h 32m
Detected by Pingoru
Oct 19, 2018, 05:05 PM UTC

Affected components

Web Portal (subdomain)Admin ConsoleUserVoice APIHelpdesk APIWidgets

Update timeline

  1. investigating Oct 19, 2018, 05:05 PM UTC

    We are currently investigating 503 errors on UserVoice web portals and the admin console. This also affects the API and widgets.

  2. monitoring Oct 19, 2018, 07:07 PM UTC

    A fix has been implemented and we are monitoring the results.

  3. resolved Oct 19, 2018, 10:38 PM UTC

    This incident has been resolved.

  4. postmortem Oct 23, 2018, 06:11 PM UTC

    On October 19th between 10:00 and 11:30 PDT UserVoice experienced two approximately 10 minute infrastructure outages that caused site-wide outages and system unavailability. **Business Impact** During the outage end users and admins would have been unable to load or interact with UserVoice sites or widgets. Email would have been delayed, but no emails were lost. **Root Cause** UserVoice uses an in-memory data-store cluster \(Redis\) to handle asynchronous job management and transient data storage. A recent change to one of the libraries that use this service caused a very sudden increase in its usage. The sudden usage increase caused a system failure and prevented failover to like-sized standby services. **What we are Doing to Prevent This** * Increased sizing of our Redis cluster and added additional alerting to allow us to more quickly detect usage spikes * Fixed the library that wasn’t properly interacting with Redis