Ambra incident

InteleShare Incident

Ambra experienced a minor incident on August 29, 2024 affecting Web Services and Image Processing and 1 more component, lasting 1h 18m. The incident has been resolved; the full update timeline is below.

Started: Aug 29, 2024, 02:17 PM UTC
Resolved: Aug 29, 2024, 03:36 PM UTC
Duration: 1h 18m
Detected by Pingoru: Aug 29, 2024, 02:17 PM UTC

Affected components

Web ServicesImage ProcessingImage Viewing

Update timeline

investigating Aug 29, 2024, 02:17 PM UTC

We have received reports of issues on the InteleShare platform. Engineering teams are currently investigating. Additional information will be provided as soon as it is available.
identified Aug 29, 2024, 02:39 PM UTC

Our Engineering teams have identified a general instability in storage and are actively investigating and working to identify a fix. We understand the urgency and we appreciate your patience as we work to address the issue.
monitoring Aug 29, 2024, 02:56 PM UTC

Our Engineering team has applied updates to address the image viewing issues. You should now notice an improvement when viewing studies. The team is still monitoring the situation to ensure everything remains stable.
resolved Aug 29, 2024, 03:36 PM UTC

The incident has been resolved and service is back to normal levels. Our team will be conducting a root cause analysis and sharing as soon as possible. We will continue to monitor the situation to ensure there are no further issues.
postmortem Dec 12, 2024, 10:49 PM UTC

Issue: Some study uploads were stalling or failing, and some images already present on platform could not be viewed. Root Cause: We experienced an increase in error rates and determined that many of the InteleShare storage processing nodes were flapping between healthy and unhealthy status. As nodes were marked unhealthy, more traffic was routed to each healthy node, overloading them with traffic and causing many of them to be marked as unhealthy as well. The rapid cycling of nodes prevented our autoscaling from receiving accurate information, delaying automatic provisioning of additional nodes to handle the traffic. During this period of instability, requests to view or upload studies may have resulted in HTTP errors or timeouts. Resolution: Some health check logic had changed in the latest InteleShare release, we reverted several of the settings to their previous values and also increased the health check timeouts. This reduced the flapping and allowed systems to remain online while autoscaling stabilized the load.