iAdvize incident

P1 - Conversation notifications might not be visible on websites

iAdvize experienced a major incident on September 2, 2024 affecting Chat and Call and 1 more component, lasting 1h 55m. The incident has been resolved; the full update timeline is below.

Started: Sep 02, 2024, 10:06 AM UTC
Resolved: Sep 02, 2024, 12:02 PM UTC
Duration: 1h 55m
Detected by Pingoru: Sep 02, 2024, 10:06 AM UTC

Affected components

ChatCallVideo

Update timeline

investigating Sep 02, 2024, 10:06 AM UTC

Please know that we are facing an issue: The notification might not appear on your Website. We are working on it.
investigating Sep 02, 2024, 10:08 AM UTC

We are continuing to investigate this issue.
identified Sep 02, 2024, 10:26 AM UTC

We are still on it. We are performing actions to resolve the issue.
monitoring Sep 02, 2024, 11:07 AM UTC

Please know that a fix is live. You should be able to see notifications on your Website again. We are monitoring this.
resolved Sep 02, 2024, 12:02 PM UTC

This incident has been resolved.
postmortem Sep 06, 2024, 12:31 PM UTC

**Incident:** On September 2nd, between 11:01 CEST and 13:04 CEST, we experienced an incident impacting the service in charge of the iAdvize engagement \(handling targeting\). ‌ During this timeframe, the display of notifications on our customers' websites and on mobile applications fluctuated between functioning randomly and not being displayed at all. As a result, starting a conversation from Chat / Call / Video / mobile application channels was degraded \(86 min\) or even completely cut off \(37 min\). Social channels were not impacted. ‌ This unavailability of our engagement service occurred because: * After a restart, our mirroring service moved to the same server instance as our engagement service * Due to an unexpected resource usage spike on the mirroring service, the engagement service was left with insufficient resources to scale and run properly ‌ **Resolution** To solve this issue, we manually isolated our mirroring service to different server instance, ensuring the engagement service had enough resources to run properly again. ‌ **Actions for the future** * \(Done\) Isolate our mirroring service away from other critical services * \(Done\) Analyze the causes of the resource increase on our mirroring service, and implement optimizations to reduce its resource usage * \(Done\) Improve alerting alerting in case of network resource issue on server instances