Ambra experienced a minor incident on November 29, 2023 affecting Web Services and Image Processing and 1 more component, lasting 2h 47m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Nov 29, 2023, 06:56 PM UTC
We have received reports of issues on the Ambra platform. Engineering teams are currently investigating. Additional information will be provided as soon as it is available.
- investigating Nov 29, 2023, 07:21 PM UTC
Our engineering teams have not yet identified the root cause. We are continuing to investigate and will provide further updates as soon as possible.
- monitoring Nov 29, 2023, 07:55 PM UTC
Our engineering teams have taken steps to address ongoing issues by restarting storage nodes and implementing some configuration changes. At this time we are seeing improvements in system performance, and will continue to monitor for further issues.
- resolved Nov 29, 2023, 09:44 PM UTC
The incident has been resolved. Our team will be conducting a root cause analysis and sharing as soon as possible. We will continue to monitor the situation to ensure there are no further issues.
- postmortem Dec 01, 2023, 10:02 PM UTC
Ambra storage began experiencing a higher than usual number of errors and timeouts, causing slow image viewing and/or gateway backlogs. We observed high network activity on our caching cluster, causing network buffers to increase and leading to rapid memory growth. In order to mitigate the issue we modified some cache configuration settings in order to limit the buffer sizes and slow the memory growth. We also deployed additional storage nodes and implemented logic on our load balancers in order to more quickly reduce traffic to unhealthy nodes. To address the underlying problem of high network activity, an upcoming Ambra release will contain several optimizations to significantly reduce the overall network traffic required for cache lookups.