Limble CMMS incident

Issues with creating Tasks, Parts, Assets

Limble CMMS experienced a critical incident on November 12, 2024 affecting Limble CMMS Web Application and Limble CMMS API, lasting 1h 47m. The incident has been resolved; the full update timeline is below.

Started: Nov 12, 2024, 08:11 PM UTC
Resolved: Nov 12, 2024, 09:59 PM UTC
Duration: 1h 47m
Detected by Pingoru: Nov 12, 2024, 08:11 PM UTC

Affected components

Limble CMMS Web ApplicationLimble CMMS API

Update timeline

investigating Nov 12, 2024, 08:11 PM UTC

We are currently investigating this issue.
investigating Nov 12, 2024, 08:13 PM UTC

We are continuing to investigate this issue.
investigating Nov 12, 2024, 09:10 PM UTC

We have identified the issue and have taken steps toward a fix.
monitoring Nov 12, 2024, 09:31 PM UTC

A fix has been implemented and we are monitoring results
resolved Nov 12, 2024, 09:59 PM UTC

This incident is now resolved.
postmortem Nov 14, 2024, 08:12 PM UTC

**Date:** November 12, 2024 **Status:** Resolved ## Summary On November 12, 2024, the application experienced a rolling service disruption that intermittently impacted our customers when attempting to create new items in our application, such as tasks, assets, and parts. The incident was caused by a database migration deployed into production, which overloaded our databases and caused delays in propagating requests. Immediate action was taken to terminate the migration and restore service to customers. to prevent a recurrence, new deploy procedures and monitors are being implemented. ## Impact For a period of approximately 3 hours, customers intermittently encountered delays or failures when attempting to create items, such as tasks, assets, and parts in our application. In some cases, the action appeared to fail, but the items were successfully created, appearing in the application after some delay. ## Root Cause The incident was caused by a database migration which overloaded our production database. Overloading of the database directly led to a increase in ‘replication lag’. When this metric exceeded 1 second, our applications’ workflows began failing or timing out. ## Resolution and Improvements Once discovered the offending database migration was immediately terminated, restoring service to all customers. Next, that migration was corrected by our Engineers, thoroughly tested using improved protocols, and re-executed without a recurrence of service disruption. Additionally, the following improvements will be implemented: * Monitoring and Alerting Improvements * Stricter requirements in testing of all migrations using a near production sandbox database * High-risk migrations will be executed during planned maintenance windows ### Timeline of Events * 11:29 AM MST: Database migration initiated. * 12:03 PM MST: Customers begin reporting disruptions. * 12:30 PM MST: Investigation and communication initiated. * 12:57 PM MST: Incident is declared. * 2:15 PM MST: Root cause identified and solution identified. * 2:22 PM MST: Solution implemented and verified in production. * 2:37 PM MST: Incident resolved following further monitoring. ### Key Points * No loss of our customers' historical data. * Not all customers were impacted at the same time. This was a rolling disruption.