Ambra incident

InteleShare Incident

Ambra experienced a major incident on December 6, 2024 affecting Web Services and Image Processing and 1 more component, lasting 3h 15m. The incident has been resolved; the full update timeline is below.

Started: Dec 06, 2024, 01:33 PM UTC
Resolved: Dec 06, 2024, 04:48 PM UTC
Duration: 3h 15m
Detected by Pingoru: Dec 06, 2024, 01:33 PM UTC

Affected components

Web ServicesImage ProcessingImage Viewing

Update timeline

investigating Dec 06, 2024, 01:33 PM UTC

We have received reports of issues on the InteleShare platform. Engineering teams are currently investigating. Additional information will be provided as soon as it is available.
investigating Dec 06, 2024, 02:08 PM UTC

Our engineering teams have not yet identified the root cause. We are continuing to investigate and will provide further updates as soon as possible.
monitoring Dec 06, 2024, 02:31 PM UTC

At this time, our Engineering team has identified an issue with our storage nodes that has resulted in problems related to loading/viewing images. These nodes have been restarted and should now be operational, however there will be a backlog as these nodes catch up. This may result in some residual slowness. We will be continuing to monitor the situation.
monitoring Dec 06, 2024, 03:20 PM UTC

We are still continuing to monitor for any further issues at this time, however studies should be able to be viewed without issue. Residual slowness may occur as we work through our backlog.
resolved Dec 06, 2024, 04:48 PM UTC

The incident has been fully resolved and service is back to normal levels. Our team will be conducting a root cause analysis and sharing as soon as possible. We will continue to monitor the situation to ensure there are no further issues.
postmortem Dec 12, 2024, 10:20 PM UTC

> Issue: > > After the recent InteleShare release, a small percent of transcoding requests began failing intermittently, in a way that caused the thread that was handling that request to enter an infinite loop waiting for additional data that would never arrive. As these stuck threads accumulated, their combined resources eventually led to performance degradation, and finally to errors or timeouts when the maximum thread pool limit was reached. > > Root Cause: > > The recent release of InteleShare included updates to a client library used for internal network communication. The new library improved overall performance, but had different timeout behavior which could sometimes cause slow connections to be closed but without passing the error through to other components. > > Resolution: > > We have adjusted our configuration settings so that the new library behaves similarly to the previous library and the system is now stable.