Benevity incident

Small percentage of users were encountering a 500 error in Spark

Minor Resolved View vendor source →

Benevity experienced a minor incident on October 16, 2024 affecting Donate and Volunteer Core Services, lasting 56m. The incident has been resolved; the full update timeline is below.

Started
Oct 16, 2024, 05:39 PM UTC
Resolved
Oct 16, 2024, 06:36 PM UTC
Duration
56m
Detected by Pingoru
Oct 16, 2024, 05:39 PM UTC

Affected components

Donate and Volunteer Core Services

Update timeline

  1. monitoring Oct 16, 2024, 05:39 PM UTC

    We have identified an issue where a small percentage of users were encountering a 500 error when attempting to complete a transaction in Spark. Only the ability to create a transaction was impacted, and any transactions that were successful were not impacted. The issue was immediately identified, and a fix has been tested, implemented, and has resolved the issue, though we are continuing to monitor in case further issues arise.

  2. resolved Oct 16, 2024, 06:36 PM UTC

    This incident has been resolved.

  3. postmortem Nov 01, 2024, 09:46 PM UTC

    ## Summary On October 11, 2024, a change in the match amount estimation code introduced an issue that caused some donors to encounter a 500 error when attempting to make a donation. This error was promptly identified on October 15, 2024, through our automated monitoring system and was resolved with a code update and deployment on the morning of October 16. ## Impact Up to 37 donors experienced a total of 53 error messages while trying to donate. Some donors attempted to refresh the error page or retry their donation, which led to more than one error per donor on average. Additionally, processing for some payroll periods was briefly delayed due to the bug. However, no incorrect payroll transactions were processed, and manual intervention allowed for a quick and effective recovery. ## Root Cause The code update assumed that the backend API would always return a non-null value. However, under certain conditions, it could return a null value, which triggered a 500 error when estimating the match amount for these 37 donors. This resulted in an error screen being displayed to the affected users. ## Future Mitigation While our dashboard and associated alerts detected 8 initial errors and prompted our investigation, they did not capture all 53 occurrences. We are enhancing our error monitoring and alerting processes to improve the detection and notification of all 500 errors. Additionally, we are increasing our automated testing around null-safety to help prevent similar issues from reaching production in the future. ## Timeline of Events * Oct 15, 2024 02:00 MDT- Error logs detected, teams are alerted and begin working on fix. * Oct 15, 2024 10:43 MDT - Initial code fix is merged. Team identifies further work required. * Oct 15, 2024 15:30 MDT - Follow-up work is merged. * Oct 16, 2024 01:00 MDT - Team identifies a separate code fix is required for Challenges. * Oct 16, 2024 03:26 MDT - Challenges code fix is merged. * Oct 16, 2024 06:21 MDT - Release is deployed to production and issue is resolved.