iAdvize incident

P1 - Disturbances on the Conversations Panel and Administration App

Major Resolved View vendor source →

iAdvize experienced a major incident on November 20, 2024 affecting Login and Login and 1 more component, lasting 9m. The incident has been resolved; the full update timeline is below.

Started
Nov 20, 2024, 05:12 PM UTC
Resolved
Nov 20, 2024, 05:21 PM UTC
Duration
9m
Detected by Pingoru
Nov 20, 2024, 05:12 PM UTC

Affected components

LoginLoginChatEngagement NotificationRest APICopilot ShopperEngagement settingsConversation viewsCallFacebook

Update timeline

  1. monitoring Nov 20, 2024, 05:12 PM UTC

    From 5:19PM CET to 5:52PM CET, we encountered disturbances on all the platform. You may have noticed blank page on the conversation panel or on the administration. Also difficulties to close conversations. This has been identified, and the technical team has restored the situation. We keep on monitoring the platform.

  2. resolved Nov 20, 2024, 05:21 PM UTC

    This incident has been resolved. More information on this status will be provided in the coming days. Thank you for your patience.

  3. postmortem Nov 25, 2024, 01:45 PM UTC

    **Incident:** On November 20th \(17:19 > 17:52 CET\) and November 21st \(9:30 > 9:43 CET\), we experienced two incidents degrading the user experience on the Conversation panel and Administration. During this period, conversation processing by agents was disrupted by white screens or error messages. In addition, the monitoring of stats reports by managers has also been impacted by error messages. These disturbances are the result of changes made to the platform infrastructure as part of our regular and scheduled system maintenance. Although initially qualified as non major risk and validated in a pre-prod environment, these planned actions had an unexpected impact on platform stability. Access to services critical to the proper operation of the platform have been temporarily cut. ‌ **Resolution** To solve this issue, our technical team had to manually change some settings on these critical services and then to restart them. Getting the required underlying services back to their nominal state allowed the Conversation panel and the Administration application to return to their own nominal state. ‌ **Actions for the future** * \(Done\) Review our internal processes to ensure that customer communication on our [status page](https://status.iadvize.com/) is more responsive * \(Done\) Review our maintenance process to better identify and scope potential negative impacts on the iAdvize platform and adapt our execution plans subsequently * \(Done\) Improve probes and alerting on failing services to improve reactivity ‌ **Focus on the Black Friday period** Looking ahead to the next critical period, we're confident that we'll handle incoming traffic on the iAdvize platform without disruption. This incident is the consequence of manual actions whose impact has not been adequately anticipated. This is not a problem related to traffic management or platform scaling. In the meantime, we have been proactive in getting the iAdvize platform ready and we reviewed teams' preparation for this high-traffic period. Our modus operandi is based on three pillars which have already been identified and implemented: * freezing period : no new code added in production * stress test : test the platform’s scalability with heavy load pick traffic * team’s mobilization : assigning the right people to monitor the main components 24/7 Be assured that our team and platform are ready for the end of the year.