UserVoice experienced a major incident on October 19, 2018 affecting Web Portal (subdomain) and Admin Console and 1 more component, lasting 5h 32m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Oct 19, 2018, 05:05 PM UTC
We are currently investigating 503 errors on UserVoice web portals and the admin console. This also affects the API and widgets.
- monitoring Oct 19, 2018, 07:07 PM UTC
A fix has been implemented and we are monitoring the results.
- resolved Oct 19, 2018, 10:38 PM UTC
This incident has been resolved.
- postmortem Oct 23, 2018, 06:11 PM UTC
On October 19th between 10:00 and 11:30 PDT UserVoice experienced two approximately 10 minute infrastructure outages that caused site-wide outages and system unavailability. **Business Impact** During the outage end users and admins would have been unable to load or interact with UserVoice sites or widgets. Email would have been delayed, but no emails were lost. **Root Cause** UserVoice uses an in-memory data-store cluster \(Redis\) to handle asynchronous job management and transient data storage. A recent change to one of the libraries that use this service caused a very sudden increase in its usage. The sudden usage increase caused a system failure and prevented failover to like-sized standby services. **What we are Doing to Prevent This** * Increased sizing of our Redis cluster and added additional alerting to allow us to more quickly detect usage spikes * Fixed the library that wasn’t properly interacting with Redis