Box experienced a critical incident on February 7, 2025 affecting Content API and Login/SSO, lasting 1h 22m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Feb 07, 2025, 03:10 AM UTC
We are investigating an ongoing issue affecting multiple Box services. We will provide more information as soon as it is available.
- investigating Feb 07, 2025, 03:18 AM UTC
We are continuing to investigate this issue.
- investigating Feb 07, 2025, 03:31 AM UTC
We are continuing to investigate this issue.
- monitoring Feb 07, 2025, 04:13 AM UTC
A fix has been implemented and we are monitoring the results. All services were impacted due to the Scheduled Resiliency Test Maintenance. Users may have experienced the sidebar not loading on the Box web application which was also anticipated as an impacted service, though occurred outside of the test maintenance window.
- resolved Feb 07, 2025, 04:33 AM UTC
After further monitoring, this incident is now considered resolved. All services have been restored to full functionality. If you continue to experience any issues, please contact Box Support at https://support.box.com.
- postmortem Feb 24, 2025, 07:15 PM UTC
We recently addressed issues affecting Box services. We would like to take the opportunity to further explain these issues and the steps we have taken to keep them from happening in the future. On February 6, 2025, between 5:00 PM PST and 7:00 PM PST, Box conducted a planned zonal resilience test as part of ongoing efforts to enhance system reliability. During the test, all traffic from one active service zone was diverted to other healthy zones. As a result, some users may have experienced slowness, login failures, or other difficulties while using Box. The impact lasted until 7:04 PM PST, extending four minutes beyond the planned maintenance window. During the test, some backend services experienced temporary overloading, making the impact more severe than anticipated. We mitigated the issue by redistributing traffic across all active zones. **Analysis** During a planned zonal resilience test, all traffic from one active service zone was diverted to other zones. However, one zone unexpectedly experienced a disproportionate increase in traffic compared to the others. While the backend services in that zone generally had enough capacity to handle the load, a cache instance became overloaded due to a key that was accessed significantly more than others. This overload triggered a ripple effect, impacting database health, which in turn caused requests to pile up at the Edge layer, ultimately leading to request rejections. During the test, we successfully identified and root-caused the issue, mitigating it by ending the test and redistributing traffic more evenly across all active zones. The issue revealed the following areas of improvements: * Insufficient monitoring for single-instance failures during resilience testing. * Uneven traffic distribution among zones, leading to localized overload. * Lack of mechanisms to detect and prevent key issues from overloading cache instances. **Corrective Actions** Box has initiated the following corrective actions: * Enhancing zonal resilience mechanics to ensure more balanced traffic distribution during a zonal failure. * Improving monitoring to detect single-instance failures more effectively during resilience testing. * Optimizing cache access patterns to prevent key issues from overloading cache instances. We are continuously working to improve Box and want to make sure we are delivering the best product and user experience we can. We hope we have provided some clarity here and we would be happy to answer any questions you may still have regarding this matter. Sincerely, The Box Team