Box incident

[Critical] Issues with Uploads, Downloads and Logins

Box experienced a critical incident on January 16, 2025 affecting Login/SSO and Uploads/Downloads, lasting 49m. The incident has been resolved; the full update timeline is below.

Started: Jan 16, 2025, 01:12 AM UTC
Resolved: Jan 16, 2025, 02:02 AM UTC
Duration: 49m
Detected by Pingoru: Jan 16, 2025, 01:12 AM UTC

Affected components

Update timeline

investigating Jan 16, 2025, 01:12 AM UTC

We are investigating an ongoing issue affecting uploads, downloads and logins. We will provide more information as soon as it is available.
monitoring Jan 16, 2025, 01:23 AM UTC

A fix has been implemented and services have recovered. We are monitoring the results.
resolved Jan 16, 2025, 02:02 AM UTC

After further monitoring, this incident is now considered resolved. Uploads, downloads, and logins have been restored to full functionality. If you continue to experience any issues, please contact Box Support at https://support.box.com.
postmortem Feb 07, 2025, 02:15 AM UTC

We recently addressed issues affecting Box Logins and Webapp. We would like to take the opportunity to further explain these issues and the steps we have taken to keep them from happening in the future. Between 3:52pm PST and 5:10pm PST on January 15th, 2025, some users may have experienced difficulties while working in Box. During this time, some of the load balancers taking customer traffic suffered from memory exhaustion leading to users experiencing intermittent issues logging in to Box and connecting to the Box Webapp. The issue occurred due to a latent memory exhaustion problem in some of our public load balancer instances and was exacerbated by peak traffic levels. We were able to resolve the issue by performing a rolling restart of the affected instances and increasing the available load balancer instances to support peak traffic levels. In addition, we are working on improving our observability into the latent memory issues and previously unknown signals on these systems to prevent similar issues from occurring in the future. **Analysis** Starting at 5:30am PST on January 15th, some external load balancer instances which are responsible for routing all customer traffic started to exhaust their shared memory allocation. This happened due to organic traffic growth and an increase in the number of backends to which the load balancers were proxying traffic. While the overall memory of these systems remained at an acceptable level throughout the incident window, the shared memory zone was not tracked as a separate metric; as a result, the team was not alerted to this resource exhaustion. As the day went on and traffic levels increased, at 3:52pm PST these instances started to experience intermittent problems passing traffic to their backends, which led to customers experiencing intermittent errors or slowdowns when accessing Box, \(including logins as well as the Webapp\). Once the problem was identified during the investigation, we performed a rolling restart of the affected load balancer instances and increased the number of available instances. As a result of these efforts, overall site health was immediately improved and was considered recovered at 5:10pm PST. **Corrective Actions** Box has initiated the following corrective actions: * Improving our tracking and alerting on shared memory zone utilization \[DONE\] * Improving external test coverage as well as internal SLO baselines * Automating process for rolling reloads or restarts across the load balancing fleet We are continuously working to improve Box and want to make sure we are delivering the best product and user experience we can. We hope we have provided some clarity here and we would be happy to answer any questions you may still have regarding this matter. Sincerely, The Box Team