Ambra incident
July 2023: Ambra service degradation on Web Services
Ambra experienced a minor incident on July 6, 2023 affecting Web Services and Image Processing and 1 more component, lasting 51m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Jul 06, 2023, 04:56 PM UTC
Our Engineering teams are actively investigating and working to identify the root cause. We understand the urgency and we appreciate your patience as we work to address the issue.
- investigating Jul 06, 2023, 05:32 PM UTC
Our engineering teams have not yet identified the root cause. To mitigate the issue we have restarted transcoding nodes and we are seeing improved viewing speeds. We continue to investigate the issue.
- investigating Jul 06, 2023, 05:46 PM UTC
We are continuing to investigate this issue.
- resolved Jul 06, 2023, 05:47 PM UTC
The incident has been resolved and service is back to normal levels. Our team will be conducting a root cause analysis and sharing as soon as possible. We will continue to monitor the situation to ensure there are no further issues.
- postmortem Jul 07, 2023, 10:37 PM UTC
Several complex studies were being viewed at once, causing Ambra's transcoding servers to become overloaded and leading to response times up to 10-20 times as long as normal. These slow response times caused users to refresh web pages or repeat their requests, further overloading the servers. Due to thread pooling and other shared resources, these slow transcoding responses also caused delays in some storage requests that did not require transcoding. To mitigate the immediate issue, Ambra engineers restarted all transcoding services to clear any hung processes or threads. They also deployed additional transcoding nodes in order to better handle spikes in transcoding activity. We have identified some of the studies which caused transcoding slowdowns, and our development team will investigate ways to improve performance for these types of study.