Abrigo incident
CrowdStrike outage affecting Abrigo applications
Abrigo experienced a critical incident on July 19, 2024 affecting Abrigo ID (login.abrigo.com) and BAM+ and 1 more component, lasting 9h 4m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Jul 19, 2024, 05:48 AM UTC
We are currently experiencing multiple service failures across numerous devices in AWS. The root cause appears to be related to general connectivity issues to certain EC2 instance types due to a recent update to the CrowdStrike agent causing a stop error within the Windows operating system. We are actively investigating and working with AWS Support to remediate as quickly as possible. This event appears to be impacting numerous AWS customers running Windows O/S and CrowdStrike.
- identified Jul 19, 2024, 07:14 AM UTC
CrowdStrike Engineering has identified a content deployment related to this issue and reverted those changes.
- identified Jul 19, 2024, 02:03 PM UTC
CrowdStrike outage affecting Abrigo applications A global outage with service provider, CrowdStrike, has made many of the Abrigo applications unavailable. Abrigo is in the process of remediating the issues. CrowdStrike has provided a resolution, but implementation requires multiple steps per impacted machine. We will continue to update this page, and we recommend subscribing to receive notifications as updates occur.
- identified Jul 19, 2024, 03:00 PM UTC
Abrigo continues to implement the necessary remediation steps to bring systems online. We are working through automating the process so progress will accelerate moving forward. We will update this page within 2 hours, and we recommend subscribing to receive notifications as updates occur.
- identified Jul 19, 2024, 04:55 PM UTC
Sageworks Analyst has been restored to operation.
- identified Jul 19, 2024, 05:00 PM UTC
Abrigo continues to implement the necessary remediation steps to bring systems online and has made significant progress. Some systems have been restored and we are closely testing and monitoring performance to ensure full operation. Our engineers continue to work through the remaining systems in order to restore as quickly as possible. Your patience during this process is appreciated. We will update this page within 2 hours, and we recommend subscribing to receive notifications as updates occur.
- identified Jul 19, 2024, 05:24 PM UTC
AbrigoID has been restored to operation.
- identified Jul 19, 2024, 05:32 PM UTC
Abrigo ID is now operational. We will continue to update this incident as products come back online.
- identified Jul 19, 2024, 05:47 PM UTC
Online Portal Now is operational, but is not running at full capacity yet and may still experience some slowness. Full functionality is available.
- identified Jul 19, 2024, 07:03 PM UTC
We are continuing to work on a fix for this issue.
- identified Jul 19, 2024, 07:17 PM UTC
fileservice.abrigo.com is functioning as expected. Abrigo continues to implement the necessary remediation steps to bring systems online and has made significant progress. Some systems have been restored, and we are closely testing and monitoring performance to ensure full operation. Our engineers continue to work through the remaining systems as quickly as possible.
- identified Jul 19, 2024, 09:36 PM UTC
IQ AutoScan has been restored to operational Sageworks API has been restored to operational BAM+ is currently in a partial outage Direct File and IQAS are currently Degraded. We are continuing to make significant progress on the BAM+ Suite and have successfully completed the majority of recovery tasks with work continuing to ensure the availability of all functionality.
- identified Jul 19, 2024, 09:36 PM UTC
IQ Autoscan is operational
- identified Jul 19, 2024, 10:02 PM UTC
LoanLoss Analyzer and VuluCast are operational.
- identified Jul 19, 2024, 10:31 PM UTC
FinCEN DirectFile is now Operational Sageworks API is now Operational BAM+ is currently Degraded We continue to work to full resolution of this issue
- identified Jul 19, 2024, 11:07 PM UTC
All system are performing as expected.
- resolved Jul 19, 2024, 11:08 PM UTC
This incident has been resolved.
- postmortem Jul 30, 2024, 05:32 PM UTC
# CrowdStrike Outage ## Incident Overview **Incident Commenced:** 19JULY2024 **Product Family Affected: All Products** **Abrigo Reference#:** IM-125 **Incident Summary:** Beginning at 1:30 am ET on July 19, Abrigo-hosted software applications became unavailable due to a global outage caused by a faulty update from our service provider, CrowdStrike. This update led to issues with the Falcon sensor software, affecting our systems. Once remediation steps were provided by CrowdStrike, Abrigo promptly tested and implemented the fix, restoring functionality to our applications. ## Incident Timeline ## Remediation Summary and Current Status: **Resolution Summary:** CrowdStrike provided resolution steps, which included either rebooting the affected machines up to 15 times or booting into safe mode to delete any .sys file beginning with c-00000291. Microsoft suggested an alternative resolution of restoring systems to backups from prior days. Abrigo implemented all these solutions in order of complexity. If repeated reboots did not resolve the issue, we either restored the system or deleted the identified .sys files. We prioritized production environments and subsequently addressed UAT environments over the weekend. ## Root Cause Analysis **Root Cause Statement:** On July 19 at 04:09 UTC, CrowdStrike released an update to the Falcon sensor software on Windows PCs and servers that contained a faulty configuration. The update included a change to the configuration file responsible for monitoring named pipes, specifically Channel File 291. This modification led to an out-of-bounds memory read in the Windows sensor client, triggering an invalid page fault. Consequently, affected machines either experienced continuous reboot cycles or entered recovery mode. **Mitigation Summary**: To mitigate the issue going forward, Abrigo will update our break glass process to ensure rapid and efficient responses. We will audit our backup settings in AWS to enhance reliability and ensure quick recovery. Additionally, we will review and refine our recovery precedence procedures to handle multiple production failures more effectively. These measures will strengthen our resilience and maintain our critical operations. ## Remediation Steps