iAdvize incident
P1. Bots are no longer operational (they no longer respond)
iAdvize experienced a critical incident on November 29, 2023 affecting Bot service (except IA features), lasting 3h 34m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Nov 29, 2023, 04:19 PM UTC
We have noticed that the bots are no longer responding. The technical team is working to resolve the problem. We're going to restart the service to restore it.
- monitoring Nov 29, 2023, 04:25 PM UTC
The service has restarted and we are seeing a return to normal.
- investigating Nov 29, 2023, 04:31 PM UTC
We are seeing new disruptions appear. Bots are taking a long time to respond or are no longer responding in some cases. We are actively working to resolve the problem
- monitoring Nov 29, 2023, 04:43 PM UTC
We have seen a return to normality in the last 5 minutes following our latest actions. We are continuing to monitor the situation.
- monitoring Nov 29, 2023, 04:57 PM UTC
We note that some conversations that took place during the incident are currently still in progress, but for which the bot is no longer responding. We are going to close these conversations manually.
- resolved Nov 29, 2023, 07:53 PM UTC
This incident has been resolved. All conversations stucked during the incident have been manually closed. Thank you for your patience.
- postmortem Dec 01, 2023, 03:52 PM UTC
## **Incident** We had a lag issue on our Bots backend services preventing Bots managed by iAdvize from being functional on your websites.Conversations handled exclusively by humans were still functional. However, if a bot intervened in the engagement flow used by visitors, conversations stopped at the first stage of the bot scenario. This lag issue occurred following the release of a version containing the first building blocks of a feature that will soon be available. This release successfully passed all our validation protocols. However, the increase in load that followed the release generated a significant lag in the incoming conversation ingestion service. As a consequence, these incoming conversations exceeded the maximum execution time and were discarded from processing. This issue happened twice on November 29th : - 16:45 to 17:22 CEST - 17:27 to 17:37 CEST ## **Resolution** In order to mitigate and restore the Bots services, we performed following actions: * Manually clean up the event overflow in the incoming conversation ingestion service * Rollback of the release that introduced the lag ## **Actions for the future** * \(Done\) Add parallelization processes on the events consumers in order to be reactive in cause of lags on bots * \(Done\) Put a limit on the events publisher in order to prevent possible lags on bots