Ambra incident

Ambra Incident

Ambra experienced a minor incident on September 12, 2023 affecting Web Services and Image Processing and 1 more component, lasting 10h 7m. The incident has been resolved; the full update timeline is below.

Started: Sep 12, 2023, 06:52 PM UTC
Resolved: Sep 13, 2023, 05:00 AM UTC
Duration: 10h 7m
Detected by Pingoru: Sep 12, 2023, 06:52 PM UTC

Affected components

Web ServicesImage ProcessingImage Viewing

Update timeline

investigating Sep 12, 2023, 03:52 PM UTC

We have received reports of issues on the Ambra platform. Engineering teams are currently investigating. Additional information will be provided as soon as it is available.
identified Sep 12, 2023, 04:07 PM UTC

Our engineering teams have identified and are working on remediating performance issues with our database that are causing issues with uploading via the web and gateway. Further updates will be provided here as they are made available.
identified Sep 12, 2023, 04:23 PM UTC

Our Engineering teams are continuing to work towards a resolution. We understand the urgency and we appreciate your patience as we work to address the issue.
identified Sep 12, 2023, 04:40 PM UTC

Our engineering teams continue working on a resolution to the image uploading issues we are experiencing and we will provide further updates as they are available.
identified Sep 12, 2023, 04:53 PM UTC

At this time there are no new developments. Engineering teams are still working towards a solution.
identified Sep 12, 2023, 05:12 PM UTC

Our engineering teams are diligently investigating potential solutions to address the image uploading issues. Once we have successfully identified a solution, we will swiftly implement it and keep you informed with regular updates.
identified Sep 12, 2023, 05:31 PM UTC

Engineering teams are implementing and testing possible solutions to our database issue and though a definitive resolution has not yet been reached, the team continues their efforts. We appreciate your patience.
identified Sep 12, 2023, 05:50 PM UTC

At this time, the engineering team has completed an initial fix attempt, which unfortunately did not yield the desired results. They are now proceeding to implement the next potential solution and we will continue to provide updates on the progress.
identified Sep 12, 2023, 06:03 PM UTC

Our team is actively engaged in the deployment of a new storage cluster to optimize database load distribution. Currently, we do not have a specific estimated time for completion, but we have allocated multiple dedicated resources to expedite the restoration of service.
identified Sep 12, 2023, 06:25 PM UTC

The team is still engaged in the deployment of the new storage cluster at this time and will provide an ETA as soon as possible.
identified Sep 12, 2023, 06:52 PM UTC

The team is still progressing through the necessary steps for deploying the new storage cluster at this time. The storage cluster should be running in approximately 30 minutes.
identified Sep 12, 2023, 07:29 PM UTC

The team is making the final preparations to bring the new storage node up and running at this time.
monitoring Sep 12, 2023, 07:51 PM UTC

The Engineering team has effectively implemented the new storage node, and we are actively monitoring the status of image processing at this time.
monitoring Sep 12, 2023, 08:11 PM UTC

Currently, we are observing the ingestion of images on the new storage node. However, studies which were partially uploaded during these issues may still be attempting to connect with the old storage node. Our engineering team is actively monitoring the situation and exploring further potential solutions.
monitoring Sep 12, 2023, 08:40 PM UTC

The new storage cluster (storelpa.dicomgrid.com) is performing well. However, studies which were partially uploaded during these issues may still be attempting to connect with the old storage node. Our engineering team is actively monitoring the situation and exploring further potential solutions.
monitoring Sep 12, 2023, 09:17 PM UTC

The new storage cluster (storelpa.dicomgrid.com) continues to performing well. During the day the gateways have developed a backlog. We do not have an ETA yet on how long the gateway backlogs to be resolved.
monitoring Sep 12, 2023, 09:54 PM UTC

All storage clusters are healthy, and study ingestion continues. We estimate 4-6 hours to process the study backlog. Progress updates will be posted hourly.
monitoring Sep 12, 2023, 10:47 PM UTC

All storage clusters are healthy, and study ingestion continues. Currently, we estimate less than 4 hours to finish processing the study backlog. Progress updates will be posted hourly.
monitoring Sep 12, 2023, 11:44 PM UTC

The study backlog is reducing quickly. We are now estimating less than 2 hours to finish processing the backlog.
monitoring Sep 13, 2023, 12:48 AM UTC

Processing of the study backlog continues. Currently, we estimate 1-2 hours to get back to normal processing.
monitoring Sep 13, 2023, 01:48 AM UTC

The study backlog continues to decline. Currently, we estimate approximately 1 hour to get back to normal processing.
monitoring Sep 13, 2023, 02:52 AM UTC

The study backlog continues to decline, however, not as quickly as anticipated previously. Hourly updates will continue until we are back to normal processing.
monitoring Sep 13, 2023, 03:54 AM UTC

The study backlog continues to decline. Hourly updates will continue until we are back to normal processing.
resolved Sep 13, 2023, 05:00 AM UTC

The incident has been fully resolved and service is back to normal levels. Our team will be conducting a root cause analysis and sharing as soon as possible. We will continue to monitor the situation to ensure there are no further issues.
postmortem Sep 18, 2023, 04:23 PM UTC

Root Cause: The database load increased due to an abnormally high number of database locks which started generating errors on some queries related to study uploads. These locks were related to some database optimization work hitting backend limits and causing longer than expected locks on the system. Remediation: The following actions were performed: • Increased the existing database cluster resources. • Created a secondary database cluster to receive new study uploads \(ingestion.\) • This scaling helped immediately with newly sent studies. However, pending studies with failures were not able to process until removed a newly introduced configuration that was causing the increased locks that was part of a set of database optimization work. Its removal does not impact the end benefit.