Core Practice incident

Elevated level of application errors

Core Practice experienced a critical incident on August 28, 2024 affecting Core Practice Application, lasting 2h 36m. The incident has been resolved; the full update timeline is below.

Started: Aug 28, 2024, 08:27 AM UTC
Resolved: Aug 28, 2024, 11:04 AM UTC
Duration: 2h 36m
Detected by Pingoru: Aug 28, 2024, 08:27 AM UTC

Affected components

Core Practice Application

Update timeline

investigating Aug 28, 2024, 08:27 AM UTC

We're experiencing an elevated level of application errors and are currently investigating the issue.
identified Aug 28, 2024, 08:48 AM UTC

We have identified the affected servers and rerouted all traffic to unaffected servers. We are currently investigating the root cause.
identified Aug 28, 2024, 08:54 AM UTC

We are continuing to work on a fix for this issue.
monitoring Aug 28, 2024, 10:42 AM UTC

A fix has been implemented and we are monitoring the results.
resolved Aug 28, 2024, 11:04 AM UTC

Summary of Impact: On 28th August 2024, between 06:10 PM and 6:40 PM (AEDT), a significant number of users intermittently encountered application errors, which inhibited their ability to access the Core Practice system. Root Cause: Upon investigation, we identified that a small number of dependency requests were returning 500 errors, preventing the Core Practice application from loading successfully. This issue was exacerbated by the recent migration of Core Practice to a new infrastructure architecture, which took longer than expected to diagnose. We found that some server pools were functioning as expected, and steps were taken to reroute all traffic to these unaffected pools. Further investigation revealed that one of the servers in the affected pool had failed to load, prompting us to rebuild the affected server. Mitigation: The Core Practice team is committed to improving infrastructure through the new server migration to enhance security and redundancy by routing traffic to healthy pools. We are currently working to improve our emergency procedures to reduce switchover downtime and are reviewing processes to automate repairs without manual intervention. We apologize for the inconvenience caused and thank you for your cooperation.