Box incident

[Critical] Issues with Multiple Box Services

Critical Resolved View vendor source →

Box experienced a critical incident on May 7, 2025 affecting Login/SSO and Uploads/Downloads and 1 more component, lasting 5h 48m. The incident has been resolved; the full update timeline is below.

Started
May 07, 2025, 01:30 PM UTC
Resolved
May 07, 2025, 07:19 PM UTC
Duration
5h 48m
Detected by Pingoru
May 07, 2025, 01:30 PM UTC

Affected components

Login/SSOUploads/DownloadsBox Drive

Update timeline

  1. investigating May 07, 2025, 01:30 PM UTC

    We are investigating an ongoing issue affecting multiple Box services. We will provide more information as soon as it is available.

  2. investigating May 07, 2025, 01:44 PM UTC

    We are continuing to investigate this issue.

  3. monitoring May 07, 2025, 01:55 PM UTC

    Our team has taken steps to remediate this issue and is seeing improvement for multiple services. We are continuing to monitor for any additional impact.

  4. investigating May 07, 2025, 02:28 PM UTC

    After continued monitoring of the issue, we have determined that further action is necessary. We are moving to an active investigation to identify the root cause. We will provide more information as soon as it is available.

  5. investigating May 07, 2025, 02:31 PM UTC

    We are continuing to investigate this issue.

  6. investigating May 07, 2025, 02:45 PM UTC

    We are continuing to investigate this issue. We will provide more information as soon as it is available.

  7. investigating May 07, 2025, 03:14 PM UTC

    We are continuing to investigate this issue and will provide updates as we have them.

  8. investigating May 07, 2025, 03:57 PM UTC

    We are continuing to investigate this issue and will provide updates as we have them.

  9. investigating May 07, 2025, 05:13 PM UTC

    We are continuing to investigate this issue and will provide updates as we have them.

  10. identified May 07, 2025, 06:17 PM UTC

    Our team has identified the underlying cause of this issue and is working to take remediating steps. We will provide additional updates as they become available.

  11. monitoring May 07, 2025, 06:20 PM UTC

    Our team has taken steps to remediate this issue and is seeing improvement for the impacted services. We are continuing to monitor for any additional impact.

  12. resolved May 07, 2025, 07:19 PM UTC

    After further monitoring, this incident is now considered resolved. All services have been restored to full functionality. If you continue to experience any issues, please contact Box Support at https://support.box.com.

  13. postmortem May 08, 2025, 04:35 PM UTC

    We recently addressed issues affecting multiple Box services. We would like to take the opportunity to further explain these issues and the steps we have taken to keep them from happening in the future. On May 7, 2025 between 01:21 AM and 01:00 PM PT, some users may have experienced difficulties while working in Box, particularly when interacting with Box services. The issue occurred due to a code change that inadvertently increased the memory requirements in our database backend for certain queries. This issue triggered automated remediation tooling, however, it was unable to keep up with the consistently increasing load. This resulted in degraded availability for some of our services that perform write operations for a small subset of users. We were able to resolve the issue by rolling back the change once it was identified. We have implemented enhanced monitoring to quickly catch and mitigate similar issues in the future. ‌ **Analysis** ‌ The Search Team is actively working to reduce overall quicksearch latency for customers. Quicksearch is a fast search request executed in the Box Web Application by typing a query term \(without pressing Enter to continue to FullSearch\). As part of these efforts, the team implemented validation logic that inadvertently issued a large number of parallel queries against our databases, causing them to experience memory exhaustion. This issue was not caught in testing due to differences in our production data compared to test data. ‌ **Corrective Actions** Box has initiated the following corrective actions: * Creating test datasets that reflect production data distribution. * Enforcing configuration of bounded execution pools for asynchronous processing, preventing future uncontrolled surges and improving system resilience. * Conducting a round of rollback exercises using standardized pipelines while ensuring rollback links are centrally accessible and reliable. * Improving logging and monitoring on database nodes and critical dependency traffic, to enable earlier detection of abnormal load patterns. ‌ We are continuously working to improve Box and want to make sure we are delivering the best product and user experience we can. We hope we have provided some clarity here and we would be happy to answer any questions you may still have regarding this matter. Sincerely, The Box Team