Brillium incident

AWS Service Disruption Affecting Brilium Services

Brillium experienced a major incident on October 20, 2025 affecting API and User Administration and Authentication and 1 more component, lasting 3h 2m. The incident has been resolved; the full update timeline is below.

Started: Oct 20, 2025, 07:30 AM UTC
Resolved: Oct 20, 2025, 10:32 AM UTC
Duration: 3h 2m
Detected by Pingoru: Oct 20, 2025, 07:30 AM UTC

Affected components

APIUser Administration and AuthenticationAssessment AuthoringInvitation ManagementPartner Central Custom AdministrationZapier IntegrationAssessment DeliveryRecruiter & Candidate Management

Update timeline

monitoring Oct 20, 2025, 09:55 AM UTC

Time of Initial Impact: Approximately 3:00 AM EST We are writing to inform you that Brilium services are currently experiencing an impact due to a widespread Amazon Web Services (AWS) regional outage that began around 3:00 AM EST this morning. This AWS disruption is affecting the availability and performance of various Brilium services. Our engineering and operations teams were immediately alerted and are working diligently to assess the full scope of the impact on our infrastructure. What We Are Doing: • We are in close and continuous contact with AWS to gather the latest updates on their progress. • Our teams are actively exploring and implementing potential mitigating steps where possible. • We are preparing for a swift return to full operational status once the underlying AWS issue is resolved. We understand the criticality of our services to your operations and sincerely apologize for the inconvenience and disruption this unplanned outage is causing. We are committed to providing you with regular updates and will post a new notification as soon as we have substantive information from AWS or when the incident is resolved. Thank you for your patience and understanding as we navigate this external issue. Next Update Expected: 15 minutes
monitoring Oct 20, 2025, 10:10 AM UTC

Current Status: Resolved / Monitoring Time of Resolution: Approximately 6:00 AM EST We are pleased to report that the widespread Amazon Web Services (AWS) regional issue that began around 3:00 AM EST appears to be resolved by AWS. We are now seeing a steady return to normal operations and full service availability across all affected Brilium services. Next Steps & What We Are Doing Now: • Our engineering team has confirmed that all core services and customer-facing features are back online and operational. • We are now in an extensive monitoring and stabilization phase. We will continue to closely watch system performance and metrics over the next several hours to ensure complete stability and prevent any potential lingering effects. We know this outage caused significant disruption, and we sincerely appreciate your patience and understanding throughout this external incident. If you continue to experience any unusual issues with a specific Brilium service, please do not hesitate to reach out to our support team. We will provide one final wrap-up report once we have completed the monitoring phase and are 100% confident in the full, sustained restoration of service.
resolved Oct 20, 2025, 10:32 AM UTC

Final Resolution: Full Service Restoration and Normal Operation Current Status: Resolved Time of Final Confirmation: Approximately 6:30 AM EST We are happy to confirm the full and sustained resolution of the earlier service disruption caused by the Amazon Web Services (AWS) regional outage. Our intensive monitoring period has concluded, and we have verified that: • All Brilium services are fully operational. • All systems are running within normal performance parameters. • There are no residual effects or lingering issues from the external AWS incident. We consider this incident closed. Thank you once again for your patience and understanding during this unplanned disruption. We appreciate your reliance on Brilium services and remain committed to providing you with reliable performance.
postmortem Oct 21, 2025, 01:10 AM UTC

# **AWS Regional Outage \(October 20, 2025\)** ## **1. Executive Summary** On October 20, 2025, Brillium services experienced a significant disruption lasting approximately 4.5 hours due to an external, widespread regional outage within Amazon Web Services \(AWS\). The incident began at 3:00 AM EST and primarily impacted service availability and performance for customers relying on the affected AWS region. The core issue was external to Brillium’s platform. Our focus during the incident was on confirmation, communication, and swift restoration, which was completed by 7:30 AM EST after AWS reported their upstream resolution. ## **2. Key Details** | Metric | Detail | | --- | --- | | **Incident Name** | AWS Regional Service Disruption | | **Date** | October 20, 2025 | | **Duration** | 4 hours, 30 minutes \(3:00 AM EST to 7:30 AM EST\) | | **Impacted Services** | All core Brillium services \(including API, Data Processing, and Web Frontend\) | | **Root Cause** | Widespread regional outage in AWS \(External\) | | **Resolution Status** | Fully Resolved | ## **3. Impact Analysis** During the incident window \(3:00 AM - 6:00 AM EST\), customers experienced: * **Service Unavailability:** Difficulty accessing or connecting to various Brilium applications. * **Performance Degradation:** Increased latency and intermittent timeouts when services were partially available. * **Data Processing Delays:** Backend processing queues were backed up, leading to delays in scheduled tasks and data updates. The primary customer impact was loss of service availability for the duration of the upstream AWS outage. ## **4. Root Cause** The root cause was confirmed to be a major service disruption impacting a critical AWS region upon which a portion of Brillium’s infrastructure relies. This was an external failure of the cloud provider’s infrastructure. * **Brillium Action:** The incident was immediately confirmed via AWS status pages and internal monitoring systems. * **External Cause:** An initial AWS failure \(e.g., networking or power event\) cascaded across availability zones within the region. ‌ ## **5. Incident Timeline \(All Times EST\)** | Time | Event | | --- | --- | | **3:00 AM** | Internal monitoring alerts triggered across multiple Brillium services. Incident declared. | | **3:15 AM** | External AWS status page confirms a major regional incident affecting multiple services. | | **3:30 AM** | Initial customer status update posted to [status.brilium.com](http://status.brilium.com) identifying the external AWS issue. | | **6:00 AM** | AWS reports resolution of the underlying issue, and Brilium systems begin self-recovering. | | **6:15 AM** | Service restoration update posted; Brillium enters extensive monitoring phase. | | **6:30 AM** | Brillium monitoring confirms all services are stable, running within normal parameters, and fully functional. Final resolution update posted. | ## **6. Corrective Actions and Lessons Learned** While the root cause was external, we identified opportunities to improve our monitoring and response to similar external events: | Area | Action Item | Target Date | | --- | --- | --- | | | | **Alerting** | Enhance specific alerting thresholds to differentiate between high load/internal issues and sudden, widespread external availability failures. | End of Q2 2026 | | **Communication** | Create pre-drafted status page templates for common external dependency failures \(e.g., AWS, other third-party providers\) to expedite initial communication. | Immediate | | **Monitoring** | Implement synthetic transactions \(probes\) in a secondary, unaffected region to quickly confirm global service health during local regional outages. | Q4 2025 | We appreciate the patience of our customers during this disruption and are committed to implementing these actions to enhance the resilience of the Brillium platform.