Benevity experienced a major incident on November 25, 2024, lasting —. The incident has been resolved; the full update timeline is below.
Update timeline
- resolved Nov 25, 2024, 04:46 AM UTC
On October 17, 2024, users attempting to access their Spark client sites between 15:40 MT and 15:50 MT would have been unable to perform any operations, including login, due to a severe degradation of system performance
- postmortem Nov 25, 2024, 04:46 AM UTC
## Summary On October 17, 2024, users attempting to access their Spark client sites between 15:40 MT and 15:50 MT would have been unable to perform any operations, including login, due to a severe degradation of system performance. This degradation was the result of a section of non-performant code, specifically related to the 'Volunteering Auto-Approval' functionality. A large and sudden increase in traffic, from a single Spark client, utilizing this functionality caused a cascading failure in the underlaying Spark database, resulting in the inability to service further user traffic. Benevity's Engineering teams have identified and remediated the issue to prevent future occurrences. ## Impact For a period of 10 minutes, beginning 15:40 MT and ending 15:50 MT, user attempting to access Spark would have experienced degraded performance including slow page load times and unavailability of key workflows, including login. Beginning at 17:43 MT on October 17 until 14:38 MT on October 18, the 'Auto-Approval' functionality was disabled for a single Spark client site. Users for this client would have been able to submission volunteering track time requests, however they would not have been approved automatically. All submissions during this time were retroactively approved by Benevity's Services team, users may not have seen volunteering rewards appear in their Giving Account until October 25. Direct communication with the affected client was performed through the duration of this incident. ## Root Cause Through investigation, Benevity's Engineering and DevOps teams identified a section of non-performant code in the Volunteer Auto-Approval functionality which performed an expensive and slow query against Spark's underlaying backend database. A large and sudden increase in number of volunteering track time submissions caused this non-performant code to apply a high enough level of load to the underlaying database such that it was unable to continue to respond to further requests. This resulted in Spark being unable to service any client requests, impacting all key workflows, including login. ## Future Mitigation * Benevity's Engineering team removed the identified non-performant data query; this will no longer affect the operation of the Auto-Approval functionality. * A review of all 'slow queries' should be performed in order to proactively remove or optimize. ## Timeline of Events October 17, 2024 * 15:40 MT - Start of incident * 15:50 MT - High load subsides; all Spark client sites available * 16:29 MT - Incident team identifies 'Auto-Approve Volunteer Time' functionality, for a single client, as the cause of traffic increase * 17:43 MT - Auto-Approval functionality disabled for identified client to mitigate future occurrences October 18, 2024 * 09:30 MT - Investigation determines the underlaying issue causing non-optimal performance of 'Auto-Approval' functionality * 10:00 MT - Benevity Services team begin manual mitigation of approval requests * 14:12 MT - Production deployment of Engineering performance enhancement * 14:38 MT - 'Auto- Approval' functionality re-enabled for identified client