Benevity experienced a critical incident on March 20, 2025 affecting Donate and Volunteer Core Services and Benevity Platform and 1 more component, lasting 1h 5m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Mar 20, 2025, 01:01 AM UTC
We have received reports of clients receiving 504 bad gateway errors. Teams are currently investigating the cause and working on a resolution.
- monitoring Mar 20, 2025, 01:24 AM UTC
A fix has been implemented and we are monitoring the results.
- resolved Mar 20, 2025, 02:07 AM UTC
This incident has been resolved.
- postmortem Apr 21, 2025, 03:38 PM UTC
## **Summary** On March 19, 2025, from 6:30pm MT to 7:15pm MT, and on March 20 from 10:20am MT to 10:30am MT, Spark client users would have experienced significant degradation when attempting to login to their Spark site and would likely have experienced error messages when attempting to perform any operations. Significant performance issues related to resource contention on the platform instances created a situation where client requests to the platform were unable to complete and timed out, resulting in an inability for users to successfully perform any actions or access any site pages. ## **Impact** For a period of 45 minutes on March 19, 2025, and a period of 10 minutes on March 20, 2025, all users attempting to access their Spark site would have either encountered an error message when attempting to login, or experiencing significant degradation when trying to access any pages within the site. ## **Root Cause** While Benevity’s clients experienced an outage/degradation of their Spark client sites, investigation determined the root cause originated from the Benevity Platform, the backend API service used by Spark. Within the Benevity Platform codebase, the team identified a non-performant section of code that was related to the generation of donation receipt PDFs. In most normal circumstances, this section of code operated well within reasonable expectations of response time and returned results within seconds. However, in certain circumstances and with specific data sets, this section of code went into a significantly degraded state of performance and resulted in additional downstream impacts to subsequent requests on the platform from Spark client sites. Because this was not just one single large and slow query to the database, this took longer for the technical team to narrow down the specific section of code. Once the problematic section of code was identified, the team was able to implement a fix to reduce the complexity of that particular section of code and substantially improve its performance at the same time. ## **Future Mitigation** * The complex and inefficient piece of code was updated and substantially improved the overall performance of receipt generation for all users on the platform. * Additional improvements to non-performant areas of the codebase are being scoped out as part of Benevity’s Quality Assurance program. ## **Timeline of Events** March 19, 2025 * 6:30pm MT - Initial alert received * 6:56pm MT - Resource contention in platform identified * 7:15pm MT - Platform instances restarted * 7:15pm MT - Spark fully operational March 20, 2025 * 10:20am MT - Initial alert received * 10:30am MT - Platform instances restarted * 10:30am MT - Spark fully operational with mitigations in place March 27, 2025 * 1:16pm MT - Issue reproduced in non-production environment * 3:30pm MT - Team begins working on the code fix March 28, 2025 * 10:32am MT - Code fix completed March 31, 2025 * 8:52am MT - Code fix validated in non-production environment April 1, 2025 * 7:49am MT - Code fix successfully deployed to production * 8:04am MT - Code fix successfully validated in production * 8:05am MT - Remediations have been completed, all systems are operational