Provation Software incident

Partial outtage - errors saving while in procedure documentation

Provation Software experienced a minor incident on March 4, 2024 affecting Provation Apex, lasting 1h 56m. The incident has been resolved; the full update timeline is below.

Started: Mar 04, 2024, 05:38 PM UTC
Resolved: Mar 04, 2024, 07:35 PM UTC
Duration: 1h 56m
Detected by Pingoru: Mar 04, 2024, 05:38 PM UTC

Affected components

Provation Apex

Update timeline

investigating Mar 04, 2024, 05:38 PM UTC

Currently investigating
investigating Mar 04, 2024, 05:43 PM UTC

We are continuing to investigate the issue.
investigating Mar 04, 2024, 06:39 PM UTC

Apex fully functional.
resolved Mar 04, 2024, 07:35 PM UTC

This incident has been resolved.
postmortem Mar 21, 2024, 10:58 PM UTC

**Postmortem: Sporadic Error Saving Notes & Printing Issues** **Incident Summary** On March 4th 11:37 CST Apex customers were experiencing sporadic errors when saving notes and encountering printing issues. Investigation revealed that **3 out of 4 apex instances were not processing larger payload traffic successfully**. All Apex instances were cleared, and issue was resolved at 13:35 CST. **Root Cause** The root cause of the issue was a **lack of available disk space on certain apex instances**. **Detailed Analysis** 1. **Disk Space Shortage**: * The lack of available disk space was identified as the primary issue. * apex instances were unable to process larger payloads due to insufficient disk space. * This impacted the overall system performance and caused sporadic errors for users. 2. **Excessive Log Files**: * Further investigation revealed that log files were consuming a significant amount of disk space. * These log files were not being deleted frequently enough, leading to the disk space shortage. * The increasing Apex traffic contributed to the accumulation of log files. 3. **Log File Management**: * The team had not adjusted the log file deletion frequency based on the increased Apex traffic. * As a result, log files were not being purged at an acceptable rate. * No alerting mechanism existed to warn the team about the scarce disk space capacity. **Corrective Actions** 1. **Immediate Disk Space Cleanup**: * The team performed an emergency cleanup to free up disk space on affected Apex instances. * Old log files were removed to alleviate the shortage. 2. **Log Rotation and Deletion Strategy**: * A log rotation and deletion strategy was implemented. * Log files are now rotated and deleted at regular intervals based on traffic patterns. * The deletion frequency is adjusted dynamically to accommodate increased traffic. 3. **Alerting System Enhancement**: * An alerting system was set up to notify the team when disk space reaches critical levels. * Alerts are triggered based on predefined thresholds to prevent future incidents. **Preventive Measures** 1. **Capacity Planning**: * Regular capacity planning exercises will be conducted to anticipate resource needs. * Disk space requirements will be reviewed and adjusted as necessary. 2. **Automated Log Management**: * Explore automated log management tools to ensure timely deletion and rotation. * Regularly monitor log file sizes and adjust retention policies accordingly. 3. **Documentation and Training**: * Document the log management process and educate team members. * Ensure everyone understands the importance of disk space management.