Neon Fundraise incident

Outage

Critical Resolved View vendor source →

Neon Fundraise experienced a critical incident on February 21, 2021 affecting Sprocket API and Donations and 1 more component, lasting 6h 37m. The incident has been resolved; the full update timeline is below.

Started
Feb 21, 2021, 10:19 PM UTC
Resolved
Feb 22, 2021, 04:57 AM UTC
Duration
6h 37m
Detected by Pingoru
Feb 21, 2021, 10:19 PM UTC

Affected components

Sprocket APIDonationsDNS <small><a href="http://status.aws.amazon.com/" target="_blank">status.aws.amazon.com</a></small>Rallybound Payment ServiceManagement interfacesCDN <small><a href="http://status.aws.amazon.com/" target="_blank">status.aws.amazon.com</a></small>Neon PaySite BuilderEndurance Challenges Strava IntegrationRegistrations

Update timeline

  1. monitoring Feb 21, 2021, 10:19 PM UTC

    We're monitoring a fix we've deployed to address the second outage that occurred this weekend.

  2. investigating Feb 21, 2021, 11:11 PM UTC

    We are still investigating the issue.

  3. monitoring Feb 22, 2021, 12:56 AM UTC

    We stabilized the issue which was being caused by custom code deployed for a single client's instance. We are planning maintenance tonight to sandbox the instances so that custom code on one instance should not affect other instances.

  4. resolved Feb 22, 2021, 04:57 AM UTC

    Our fix was deployed and tested.

  5. postmortem Feb 22, 2021, 05:56 PM UTC

    Timeline On Friday, February 19th at 3:35 PM Eastern Time for 11 minutes, and then again on Sunday, February 22nd at 5:07 PM ET for 7 minutes, our services were heavily disrupted. The outage was related to front-end template code that caused a recursive chain of exceptions to be thrown which, in turn, caused our application to continuously restart. As part of our restoration process, we restarted our applications several times, which caused our sites to be a bit slower than normal later that Sunday, at 5:44 PM ET, 6:00 PM ET, and 6:08 PM ET. Corrective and Preventative Measures After taking the offending code offline and restabilizing our platform, we implemented an update which identifies and ensures similar problematic code is unable to create disruptions in the future. The update was successfully deployed on Sunday at 11:30 PM ET. In Summary Our platform was mostly offline for a total of about 18 minutes this past weekend. Our development team took immediate action to restore service with a temporary measure, and then after analysis and testing, implemented a permanent fix to ensure that this does not happen again. We are sorry for any inconvenience this may have caused. As always, please reach out to us \([[https://helpdesk.rallybound.com/](https://helpdesk.rallybound.com/)](https://helpdesk.rallybound.com)\) with any questions or concerns.