Box incident

[Minor] Customers may have experienced issues using All Files page and Public API

Box experienced a minor incident on March 4, 2025, lasting —. The incident has been resolved; the full update timeline is below.

Started: Mar 04, 2025, 04:00 AM UTC
Resolved: Mar 04, 2025, 04:00 AM UTC
Duration: —
Detected by Pingoru: Mar 04, 2025, 04:00 AM UTC

Update timeline

resolved Mar 04, 2025, 05:50 AM UTC

Between 8:44 - 8:56 PM PT on March 3rd 2025, some users may have experienced difficulties using All Files page and Public API. No further impact has been observed and we are considering this issue to be resolved. If you are still experiencing any issues, please let us know at https://support.box.com.
postmortem Apr 24, 2025, 12:51 PM UTC

We recently addressed issues affecting Box services. We would like to take the opportunity to further explain these issues and the steps we have taken to keep them from happening in the future. Between 1:48 AM PT and 2:25 AM PT on March 3, 2025, some users may have experienced difficulties while working in Box. Additionally, starting at 8:44 PM PT that same day, some users may have once again encountered issues. The disruption ended before 9:59 PM PST. During these time periods, a subset of users experienced slowness and intermittent errors with Notes, Public API, logins and uploads/downloads. The issue occurred as a result of a fragmented system table on a database cluster which ultimately led to the database crashing. The first instance was caused by increased traffic while the second occurred due to our manual remediation process putting additional load on the database. Our database remediation service attempted to resolve the issue both times but was unsuccessful due to the thread\_cache\_size setting being set too low. We were able to address the short-term problem by manually redirecting traffic to a healthy database node. To maintain medium-term stability of the database, the team rebuilt the cluster to eliminate the fragmented table. Additionally, we will be splitting the database cluster into smaller databases to prevent future overloads and improving our database remediation service to better handle this type of case. ‌ **Analysis** The database cluster at issue experienced gradual performance degradation before the issue became apparent. This degradation was caused by the growing fragmented system table due to increasing database size and traffic. However, this degradation went unnoticed because the existing alerting system did not flag any problems. In addition, the auto-remediation system was unsuccessful because it hit a case where two database configurations were incompatible. Specifically, the max\_connections setting was increased without adjusting the thread\_cache\_size, resulting in frequent thread cache misses and preventing the failover procedure from having the resources needed to succeed. ‌ **Corrective Actions** Box has initiated the following corrective actions: * Rebuilding the database cluster to eliminate the table fragmentation and prevent medium-term performance degradation * Adding metrics and alerting for table fragmentation to proactively monitor issues * Adjusting database configurations such that thread\_cache\_size dynamically adjusts with the max\_connections database configuration settings * Improving the database remediation process by adjusting timeouts to accommodate large database clusters * Accelerating the database split process to quickly divide large clusters, reducing traffic overload and improving routine maintenance success ‌ We are continuously working to improve Box and want to make sure we are delivering the best product and user experience we can. We hope we have provided some clarity here and we would be happy to answer any questions you may still have regarding this matter. Sincerely, The Box Team