Ambra incident

Ambra Incident

Ambra experienced a major incident on April 30, 2024 affecting Web Services and Image Processing and 1 more component, lasting 40m. The incident has been resolved; the full update timeline is below.

Started: Apr 30, 2024, 07:21 PM UTC
Resolved: Apr 30, 2024, 08:02 PM UTC
Duration: 40m
Detected by Pingoru: Apr 30, 2024, 07:21 PM UTC

Affected components

Web ServicesImage ProcessingImage Viewing

Update timeline

investigating Apr 30, 2024, 07:21 PM UTC

We have received reports of slowness on the Ambra platform. Engineering teams are currently investigating. Additional information will be provided as soon as it is available.
investigating Apr 30, 2024, 07:35 PM UTC

Our Engineering teams are in the process of restarting storage containers and changing nodes to network optimized. When this is complete we anticipate improved system performance. More information is forthcoming.
resolved Apr 30, 2024, 08:02 PM UTC

The incident has been fully resolved and service is back to normal levels. Our team will be conducting a root cause analysis and sharing as soon as possible. We will continue to monitor the situation to ensure there are no further issues.
postmortem May 08, 2024, 03:26 PM UTC

After a thorough investigation, we have identified that the root cause was related to network connectivity issues, specifically dropped network packets that were meant for the storage component of our service. This led to a delay in data processing, which in turn caused a significant strain on our system's resources. To address the immediate impact and restore service functionality, our team performed a series of pod restarts. This action effectively cleared the processing backlog and stabilized the affected system components. To prevent such occurrences in the future and to enhance our service's resilience, we have taken two key steps: 1. We have transitioned to network-optimized nodes, which provide a substantial increase in network throughput, ensuring that data flows smoothly and efficiently to all parts of our application. 2. We have expanded the local storage capacity for our storage pods. This additional space serves as a buffer to accommodate any unexpected surges in data, thereby preventing potential backlogs from affecting our service performance. We are committed to providing a reliable and high-performing service, and these improvements are part of our ongoing efforts to optimize our infrastructure. We apologize for any inconvenience this may have caused and appreciate your understanding.