Box incident

[Outage] Box Services Unavailable

Critical Resolved View vendor source →

Box experienced a critical incident on May 21, 2025 affecting Login/SSO, lasting 1h 42m. The incident has been resolved; the full update timeline is below.

Started
May 21, 2025, 09:28 PM UTC
Resolved
May 21, 2025, 11:10 PM UTC
Duration
1h 42m
Detected by Pingoru
May 21, 2025, 09:28 PM UTC

Affected components

Login/SSO

Update timeline

  1. investigating May 21, 2025, 09:28 PM UTC

    Our team is investigating an issue with multiple Box services. We will provide additional information as it becomes available.

  2. investigating May 21, 2025, 09:35 PM UTC

    We are continuing to investigate this issue.

  3. identified May 21, 2025, 09:59 PM UTC

    Our team has identified the underlying cause of this issue and is working to take remediating steps. We will provide additional updates as they become available.

  4. monitoring May 21, 2025, 10:07 PM UTC

    Our team has taken steps to remediate this issue and is seeing recovery in all metrics. We are continuing to monitor for any additional impact.

  5. resolved May 21, 2025, 11:10 PM UTC

    After further monitoring, this incident is now considered resolved. If you continue to experience any issues, please contact Box Support at https://support.box.com.

  6. postmortem May 22, 2025, 04:26 PM UTC

    We recently addressed issues affecting Box and would like to take the opportunity to further explain these issues and the steps we have taken to keep them from happening in the future. Between 02:18 PM PT and 02:49 PM PT on May 21, 2025, users may have experienced difficulties while working in Box. During this time, Box services and APIs were temporarily unavailable. This was caused by a middleware service that was unable to discover our databases due to a change made in the tooling used to manage our database fleet. We were able to resolve the issue by rolling back the problematic change. In addition, we are improving our database management tool’s observability and rollout process to proactively detect and prevent similar issues from occurring in the future. ‌ **Analysis** As part of our routine patching process, we upgraded the tooling used to manage our database fleet. This upgrade was performed before certain prerequisite upgrades on a subset of critical databases. As a result, the updated tooling was incompatible with those databases, causing our middleware service to fail to discover them, leading to a service outage. The incident highlighted gaps in our database management tool’s upgrade sequencing, rollout validation, and rollback readiness as well as gaps in system observability needed for early detection and remediation of partial discovery failures, all areas we are actively addressing. ‌ **Corrective Actions** Box has initiated the following corrective actions: * Improve patching upgrade rollout validation * Optimize our database management tool’s rollback procedures for faster recovery * Enhance system observability for early detection and remediation of partial discovery failures ‌ We are continuously working to improve Box and want to make sure we are delivering the best product and user experience we can. We hope we have provided some clarity here and we would be happy to answer any questions you may still have regarding this matter. Sincerely, The Box Team