CampBrain experienced a minor incident on September 24, 2024 affecting Office Portal, lasting 12h 12m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Sep 24, 2024, 03:28 PM UTC
We are currently experiencing some slowness in the office portal. We are investigating the issue.
- investigating Sep 24, 2024, 04:02 PM UTC
We are continuing to experience slowness within the office portal. Our engineers are investigating the problem and we will update this message as more information becomes available.
- investigating Sep 24, 2024, 04:40 PM UTC
We are continuing to experience slowness within the office portal. The issue is being caused by our hosting partner Microsoft Azure, and we are continuing to investigate the problem with their assistance.
- investigating Sep 24, 2024, 05:43 PM UTC
We are continuing to experience slowness within the office portal. The issue is being caused by our hosting partner Microsoft Azure, our engineers are actively working with them to find a solution.
- investigating Sep 24, 2024, 07:33 PM UTC
The issue of slowness within the office portal continues. It is being caused by our hosting partner Microsoft Azure - we are doing all we can to escalate the problem with Azure, and work with them on a solution.
- investigating Sep 24, 2024, 09:55 PM UTC
As you know, our hosting partner, Microsoft Azure, is experiencing an issue causing the office portal slowness. We have been in constant contact with them, and continue to collaborate on a solution. Meanwhile, we have engaged all our relevant resources to explore other options to expedite a fix.
- identified Sep 24, 2024, 11:00 PM UTC
Our engineers have continued to work with Microsoft Azure on resolving the slowness issue with CampBrain portals. We have identified a fix that will involve taking CampBrain offline between 11pm-12am Eastern tonight. We sincerely apologize for any inconvenience this may cause, however it is a step we must take in order to resolve the problem.
- resolved Sep 25, 2024, 03:41 AM UTC
This incident has been resolved.
- postmortem Oct 03, 2024, 12:08 AM UTC
**Summary** On Tuesday September 24th, our automatic alerting system alerted us that some of the servers hosting our Office Portal application were not able to come online. This caused slower performance in the Office Portal from approximately 11am-11pm ET. After conducting our internal investigation, we were unable to immediately identify the cause. We had to work closely with our cloud provider, Microsoft Azure, and the process took longer than expected. The delay in resolving the issue was partly due to the time it took for the provider to determine the root cause. Eventually, they identified that the problem was caused by a failure to download a health monitoring tool during the server startup. Removing this step from the startup process resolved the issue. **Moving Forward** As a result of this incident, we have steps we are taking to both better prevent a problem like this from occurring in the future, and to improve our resolution response time for problems like this one: * Create new alerts in our alerting system to make us aware of server startup failures, to help improve our response time * Engage our cloud provider to identify if there is a better process available for us to gain their assistance more quickly and effectively * Engage our cloud provider to identify how we can be made aware of Microsoft tool end-of-life, thereby removing old, near retirement tools from our system before they are no longer accessible to us, potentially causing a problem. We would also add this information to our internal playbooks and knowledge bank. We sincerely apologize for this multi-hour disruption on September 24th. We know that system reliability is the highest priority, and we will continue to ensure our infrastructure is solid so we can reduce risk to your operations. We are grateful for your understanding and patience.