Boltrics incident

Performance degradation Business Central for some of our customers

Boltrics experienced a major incident on June 11, 2025, lasting 11h 45m. The incident has been resolved; the full update timeline is below.

Started: Jun 11, 2025, 03:30 AM UTC
Resolved: Jun 11, 2025, 03:15 PM UTC
Duration: 11h 45m
Detected by Pingoru: Jun 11, 2025, 03:30 AM UTC

Update timeline

resolved Jun 16, 2025, 10:46 AM UTC

Dear customer, First of all, we would like to sincerely apologize once again for the inconvenience caused by the service outage on June 11th. We fully understand the significant impact this has had on your operations and greatly appreciate your patience and understanding. Microsoft has shared a detailed Root Cause Analysis, confirming that the issue resulted from capacity limitations within one of their hosting clusters (AS4556). While the root cause lies within Microsoft’s infrastructure, we take the consequences for our customers very seriously. We will continue our discussions with Microsoft to explore how we can help prevent similar issues in the future and improve both response and resolution times. Our goal remains to ensure the stability and performance of the 3PL Dynamics solution for all our customers. ### **Root Cause Analysis (RCA) Microsoft:** ### Summary Based on our investigation and performance data analysis, the service outage that primarily affected Dynamics 365 Business Central customers in the Netherlands region was caused by cluster resource exhaustion. This resulted from reaching maximum node capacity in the default host group (AS4556, this was the major/main affected one), leading to persistent high CPU utilization and insufficient memory allocation across the cluster nodes. ### Technical Root Cause **Primary Issue: Cluster Resource Saturation** The outage was fundamentally caused by the cluster reaching its maximum operational capacity. Our analysis confirmed that the majority of affected tenants were hosted on this specific cluster infrastructure. The cluster had exceeded its designed tenant-to-node ratio, creating a cascading performance degradation scenario. **Resource Utilization Patterns** The performance metrics from the such outage demonstrate the resource exhaustion pattern: - CPU Utilization: Nodes consistently operated at maximum capacity (approaching 100% utilization) with multiple production instances showing sustained high processor load - Memory Constraints: Available memory dropped to critically low levels (below threshold), forcing the system into resource contention scenarios - Query Performance Impact: Long-running SQL queries increased significantly, with execution times extending beyond normal operational parameters **Infrastructure Scaling Limitations** The cluster architecture had reached its horizontal scaling limits within the default host group configuration. When cluster systems approach maximum node capacity, several performance issues emerge simultaneously: 1. Resource Competition: Multiple tenants competing for limited CPU and memory resources 2. Memory Pressure: Low memory conditions forcing increased CPU usage for memory management operations 3. Query Bottlenecks: Database operations experiencing delays due to resource contention ### Performance Impact Analysis **Login and Authentication Issues** The resource exhaustion directly impacted the authentication infrastructure, causing: - Extended login response times - Authentication service timeouts - Session establishment failures **Application Performance Degradation** Users experienced significant performance issues including: - Unworkable slow response times across Business Central operations - Extended query execution periods - System responsiveness falling below acceptable service levels ### Resolution Strategy **Immediate Remediation** We implemented a cluster splitting solution to address the immediate capacity constraints: 1. Cluster Segmentation: The overloaded cluster was divided into two separate clusters 2. Tenant Redistribution: Existing tenants were redistributed across the newly created cluster infrastructure 3. Resource Rebalancing: The splitting reduced the tenant-to-node ratio, providing adequate CPU and memory resources per tenant **Technical Benefits of Cluster Splitting** The cluster division approach provided several immediate benefits: - Reduced Resource Contention: Lower tenant density per cluster node eliminated resource competition - Improved Memory Allocation: Each cluster segment could allocate sufficient memory resources without constraint - Enhanced Query Performance: Database operations returned to normal execution timeframes with reduced resource pressure ### Preventive Measures **Capacity Monitoring Enhancement** We implemented enhanced monitoring to prevent similar incidents: - Proactive Threshold Monitoring: Real-time tracking of cluster resource utilization metrics - Automated Scaling Triggers: Early warning systems for approaching capacity limits - Tenant Distribution Optimization: Improved algorithms for balanced tenant placement across available infrastructure **Infrastructure Scaling Improvements** The outage highlighted the need for more dynamic scaling capabilities: - Elastic Cluster Management: Enhanced ability to provision additional nodes before reaching capacity limits - Resource Pool Expansion: Increased default host group capacity to accommodate growth - Performance Baseline Monitoring: Continuous tracking of key performance indicators to identify degradation trends ### Conclusion The service outage was a direct result of infrastructure scaling limitations majorly within the AS4556 cluster, where tenant growth exceeded the cluster's designed capacity. The combination of maximum node utilization, high CPU usage, and memory constraints created an environment where normal Business Central operations could not function effectively. Our cluster splitting resolution successfully addressed the immediate capacity issues and restored service performance to acceptable levels.