Products Up incident

Failure of Scheduled Database Job and Subsequent Job Processing Interruptions

Products Up experienced a critical incident on May 1, 2024, lasting —. The incident has been resolved; the full update timeline is below.

Started: May 01, 2024, 07:01 AM UTC
Resolved: Apr 30, 2024, 10:00 PM UTC
Duration: —
Detected by Pingoru: May 01, 2024, 07:01 AM UTC

Update timeline

resolved May 01, 2024, 07:01 AM UTC

A critical failure occurred in the scheduled job designed to add partitions to several tables within our database. This scheduled maintenance task did not execute as planned, directly impacting all processing jobs that rely on these database tables. Consequently, all processing jobs scheduled between 00:00 and 06:00 UTC+2 were unable to run, leading to significant delays and disruption of normal operations. Cause of Incident: The primary cause of the incident was the failure of the scheduled job tasked with adding partitions to the database tables. A deeper investigation revealed that the failure occurred due to an unforeseen software issue with our scheduler server, which also prevented the execution of our fail-safe mechanism designed to handle such failures. Operational Impact: The processing of all jobs scheduled during the affected time frame was halted, causing a backlog of data processing tasks. Immediate Actions Taken: Manual intervention was employed to run the necessary jobs once the issue was identified. Software diagnostics and immediate repairs were initiated on the scheduler server to restore its functionality. Long-Term Corrective Actions: Hardware Redundancy: Implement additional hardware redundancy for our scheduler server to mitigate the risk of a single point of failure. Enhanced Monitoring: Upgrade our monitoring systems to detect software issues more promptly before they impact critical operations. Fail-Safe Enhancements: Review and enhance the robustness of our fail-safe mechanisms to ensure they can handle unexpected failures more effectively. Regular Audits: Schedule regular audits of scheduled tasks and associated hardware to ensure they are functioning as expected without any potential risks. Conclusion: We sincerely apologize for the inconvenience caused by this incident and are committed to implementing the necessary measures to prevent such occurrences in the future. We appreciate the understanding and patience of all stakeholders during the resolution of this issue.