TechnologyOne experienced a critical incident on April 30, 2024 affecting Batch Services (DP Jobs) and Batch Services (DP Jobs) and 1 more component, lasting 6h 4m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Apr 30, 2024, 12:31 AM UTC
Our team of engineers have identified an issue with DP Job processing. We will aim to provide you with an update in the next 60 minutes.
- monitoring Apr 30, 2024, 01:00 AM UTC
A fix has been implemented and we are currently monitoring the DP job processes. A fix was applied at approximately 10:30AM BNE today, and we can see that the CPU utilisation has dropped back to standard operating levels. We will continue to monitor the stability of the services before resolving this incident. We sincerely apologise for the inconvenience this has caused you.
- resolved Apr 30, 2024, 06:35 AM UTC
The issue has been resolved. The incident originated from multiple processes unexpectedly overloading the SaaS environment management database system, resulting in high CPU utilization. The problem lasted for approximately one hour until it was resolved by terminating the processes, which returned CPU utilization to normal levels. Our engineering team has implemented preventative measures aimed at early detection of similar issues. These measures are designed to identify potential overloads before they can affect our customers, thereby ensuring that any future occurrences are addressed swiftly and efficiently, minimizing potential disruptions We thank you for your patience and cooperation as we worked through resolving this incident.
- postmortem May 23, 2024, 12:06 AM UTC
**Issue Summary:** On Tuesday 30 April 2024 at 9.15am alert monitoring indicated that our cloud orchestration platform was saturated with long running jobs. The TechnologyOne team began an investigation immediately. The impact was most noticeable with DP jobs queuing or unable to be submitted **Root Cause Analysis:** Queue and processing limits reached due to long-running processes locking the database. **Corrective Measures:** The long-running processes were terminated **Preventive Measures:** * Additional monitoring for load caused by long running jobs. * Standard operating procedure has been updated to improve similar incident investigation and resolution for this and similar issues .