Benevity incident

Spark unavailable throwing (503)

Benevity experienced a critical incident on March 6, 2025 affecting Donate and Volunteer Core Services, lasting 9m. The incident has been resolved; the full update timeline is below.

Started: Mar 06, 2025, 12:23 AM UTC
Resolved: Mar 06, 2025, 12:33 AM UTC
Duration: 9m
Detected by Pingoru: Mar 06, 2025, 12:23 AM UTC

Affected components

Donate and Volunteer Core Services

Update timeline

identified Mar 06, 2025, 12:23 AM UTC

Brief outage occurred with services.
identified Mar 06, 2025, 12:27 AM UTC

Systems are operational again we are continuing to investigate
resolved Mar 06, 2025, 12:33 AM UTC

This incident has been resolved
postmortem Mar 31, 2025, 11:39 PM UTC

## Summary On March 5, 2025, the Benevity Donate and Volunteer service was unavailable for approximately 5 minutes starting at 17:15 MT. During this time some users accessing the Donate and Volunteer service were presented with an error screen. ## Impact The incident lasted for 5 minutes. During the outage, some users were presented with an error screen. ## Root Cause As part of improving Benevity’s security posture, modifications were made to the configuration of the load balancer of a service running at the edge of the Benevity network. Those configuration changes caused the service to recycle its availability to the load balancer. The process of deregistering and re-registering against the load balancer takes approximately 5 minutes. Once the service was re-registered, access to Donate and Volunteer was restored. When reviewing the change prior to applying it, it was not clear to the team that the change would trigger a recycle process in the vendor product. ## Future Mitigation We have disseminated the behaviour of this particular service, so that the team is more aware of this specific behaviour. This change was validated in our non-production environments, but was not observed by the team applying the change in those environments. We will be implementing some additional monitoring in our non-production environments to alert on this kind of failure in the future. ## Timeline of Events * 17:15 MT - Configuration team applied * 17:15 MT - Alerting indicates failure conditions * 17:20 MT - Service recycle process completed, and service restored