Box incident
[Major] Customers may experience issues with Login and All Files page
Box experienced a major incident on March 19, 2025 affecting Login/SSO, lasting 27m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Mar 19, 2025, 10:58 AM UTC
Our team is investigating an issue with the login page, authentication, and All files page. Users may see errors or slowness when attempting to access to Box. We will provide additional information as it becomes available.
- investigating Mar 19, 2025, 11:12 AM UTC
We are continuing to investigate this issue.
- monitoring Mar 19, 2025, 11:23 AM UTC
Our team is still monitoring the issue and is seeing improvement for authentication and accessing to the All Files page. We are continuing to monitor for any additional impact.
- resolved Mar 19, 2025, 11:26 AM UTC
After further monitoring, this incident is now considered resolved. Our Service has been restored to full functionality. If you continue to experience any issues, please contact Box Support at https://support.box.com.
- postmortem Mar 27, 2025, 03:35 PM UTC
We recently addressed issues affecting several features across Box, including Logins, Uploads/Downloads, and Notes. We would like to take the opportunity to further explain these issues and the steps we have taken to keep them from happening in the future. Between between 3:06 AM and 4:35 AM PDT on March 19, 2025, some users may have experienced difficulties while working in Box. During this time, users may have experienced slowness or occasional errors when interacting with several major features in the Box platform. The issue occurred due to a latent issue that inadvertently caused several caching instances to restart in an unhealthy state while serving traffic. We were able to resolve the issue by temporarily routing traffic away from impacted caching instances until they were manually restored by on-call engineers. In addition, we addressed the underlying code issue that triggered the server restarts and will provide additional controls to prevent similar issues from occurring in the future. **Analysis** At 3:05 AM PDT, the service responsible for management of our caching clusters became unreachable due a latent code bug in its leader election process. Health checks on individual caching VMs misinterpreted the connectivity issue to this central service as a problem with the caching instances themselves. This incorrectly prompted them to restart in an attempt to remediate the issue but instead prompted them to reject live traffic. The upstream data access service, which depends on these caches, in turn was forced to route additional traffic to our databases, resulting in increased latency and potential timeouts. In this case, the impacted caching instances were contained to a single GCP availability zone. We eventually remediated this issue by using newly-developed tooling to divert all Box traffic from the degraded zone until the affected caching instances were manually restored by on-call engineers. However, this went unnoticed during the initial investigation as engineers pursued remediating the problems within the caching tier, which increased time to remediation. **Corrective Actions** Box has initiated the following corrective actions: * Changing the underlying behavior that led to improperly triggering VM recreation and adding additional controls to ensure successful initialization. * Implementing changes to ensure that leader election for this management service guarantees that a healthy server is discoverable. * Adding observability to diagnose failures contained to a single AZ and improve processes to more quickly employ a zonal drain. We are continuously working to improve Box and want to make sure we are delivering the best product and user experience we can. We hope we have provided some clarity here and we would be happy to answer any questions you may still have regarding this matter. Sincerely, The Box Team