Ambra incident

Ambra Incident

Major Resolved View vendor source →

Ambra experienced a major incident on February 19, 2024 affecting Web Services and Image Processing and 1 more component, lasting 8h 4m. The incident has been resolved; the full update timeline is below.

Started
Feb 19, 2024, 02:57 PM UTC
Resolved
Feb 19, 2024, 11:01 PM UTC
Duration
8h 4m
Detected by Pingoru
Feb 19, 2024, 02:57 PM UTC

Affected components

Web ServicesImage ProcessingImage Viewing

Update timeline

  1. investigating Feb 19, 2024, 02:57 PM UTC

    We have received reports of issues on the Ambra platform and slowness. Engineering teams are currently investigating. Additional information will be provided as soon as it is available.

  2. investigating Feb 19, 2024, 03:37 PM UTC

    Our engineering teams have not yet identified the root cause. We are continuing to investigate and will provide further updates as soon as possible.

  3. identified Feb 19, 2024, 04:07 PM UTC

    The issue has been isolated interactive services nodes. We are currently working to provision additional interactive service nodes to resolve the issue.

  4. identified Feb 19, 2024, 04:45 PM UTC

    The additional interactive have been added. We are still investigating the root cause and next steps. Additional information will be provided as soon as it is available.

  5. investigating Feb 19, 2024, 05:32 PM UTC

    Our engineering teams are focused on identifying the root cause of the incident and is dedicating all available resources to the investigation. We have also added 12 more additional interactive nodes to lessen the impact. We are working to resolve the issue and will provide updates as soon as we have more information.

  6. investigating Feb 19, 2024, 06:00 PM UTC

    Our engineering teams have not yet identified the root cause. We are continuing to investigate and will provide further updates as soon as possible.

  7. investigating Feb 19, 2024, 06:51 PM UTC

    Our engineering teams are focused on identifying the root cause of the incident and is dedicating all available resources to the investigation. We are working around to resolve the issue and will provide updates as soon as we have more information.

  8. investigating Feb 19, 2024, 07:04 PM UTC

    We are restarting the Ambra instance for emergency maintenance at 2:10 pm ET. Ambra will be down for approximately 30 minutes. More information will be posted as soon as it is available.

  9. monitoring Feb 19, 2024, 07:38 PM UTC

    The Ambra Emergency Maintenance is complete. Ambra is and users can log in and navigate. We are continuing to investigate and will provide further updates as soon as possible.

  10. monitoring Feb 19, 2024, 07:50 PM UTC

    We believe that the incident is resolved. Users can log in. The user interface is back to normal performance. However, we have a services backlog that we are working through. Our team will be conducting a root cause analysis and sharing as soon as possible. We will continue to monitor the situation to ensure there are no further issues.

  11. resolved Feb 19, 2024, 11:01 PM UTC

    The incident has been fully resolved and service is back to normal levels. Our team will be conducting a root cause analysis and sharing as soon as possible. We will continue to monitor the situation to ensure there are no further issues.

  12. postmortem Mar 06, 2024, 05:32 PM UTC

    Due to increasing resource utilization, during a planned maintenance window on February 18 we increased the memory of our caching/queueing component. During this maintenance we also changed the CPU class for increased consistency with our other systems, including our UAT environment. The modified system initially performed normally, but as platform traffic increased on February 19 we began experiencing increased operation latency, leading to degraded API performance. Modifying these components requires complete platform downtime to ensure consistent queue processing, so we first attempted to increase the number of front-end servers in order to reduce load on the backend systems. Performance improved temporarily but began degrading again as the platform reached peak time of day. We took an emergency platform outage in order to revert to the original CPU class at which point performance returned to normal levels.