Landbot incident

Downtime during component migration

Landbot experienced a major incident on June 20, 2022 affecting Builder, lasting 21m. The incident has been resolved; the full update timeline is below.

Started: Jun 20, 2022, 02:48 PM UTC
Resolved: Jun 20, 2022, 03:10 PM UTC
Duration: 21m
Detected by Pingoru: Jun 20, 2022, 02:48 PM UTC

Affected components

Builder

Update timeline

investigating Jun 20, 2022, 02:48 PM UTC

There is an ongoing incident affecting the Landbot app. The bots are working correctly. We are investigating and working on the issue with maximum priority.
investigating Jun 20, 2022, 02:48 PM UTC

We are continuing to investigate this issue.
identified Jun 20, 2022, 02:49 PM UTC

The issue has been identified and a fix is being implemented.
identified Jun 20, 2022, 02:50 PM UTC

We are continuing to work on a fix for this issue.
monitoring Jun 20, 2022, 02:51 PM UTC

A fix has been implemented and we are monitoring results.
resolved Jun 20, 2022, 03:10 PM UTC

This incident has been resolved
postmortem Jun 24, 2022, 03:04 PM UTC

# Incident summary Between the hour of 16:23 and 16:46 CEST on the 20th of June, a cascading failure during a Redis migration to version 6 affected some deployments, resulting in a service outage. # Impact This incident affected some customers, who experienced a downtime in the application for 23 minutes. # ‌Root Causes During a Redis migration, a bug in the GitLab CI pipeline was identified. This bug truncates the trailing 0 of the container image tag, so it deployed a release version `1.2` instead of `1.20`, causing a CrashLoopBackOff error for one of our Deployment containers. The environment variables for the Deployments were moved to the new Redis 6 instance when the Deployments were still in CrashLoopBackOff. # Trigger One of our Deployment’s ConfigMap was modified, changing the Redis server environment variable to the new Redis 6 instance. The restart of these Deployments updated their ConfigMaps, when one of them were still in a CrashLoopBackOff status, causing a cascading failure that affected other Deployments and triggering the incident. # Detection The Redis 6 queue started to fill up with tasks ahead of time, and the application stopped working. # Timeline **2022-06-20 \(all times are CEST\)** * **15:29** - Created new Redis 6 instance * **15:44** - Created ConfigMap and Secret for auxiliary Deployments, containing environment variables pointing to Redis 6 instance * **15:59** - _INCIDENT BEGINS_ Created auxiliary Deployments, that were deployed with release version `1.2` instead of `1.20` * **16:17** - Updated environment variables, pointing to Redis 6 instance * **16:19** - Redis 6 queue began to fill up progressively * **16:23** - _OUTAGE BEGINS_ The application stopped working * **16:39** - Deleted auxiliary Deployments * **16:46** - Redis 6 queue reached 309K length * **16:46** - _OUTAGE MITIGATED_ Environment variables restored to previous Redis 5 instance * **16:46** - _OUTAGE ENDS_ All services restored and working correctly * **17:08** - Followed procedure to continue with the Redis 6 migration * **17:10** - After running GitLab CI pipeline to create auxiliary Deployments, container images were manually set to release `1.20` * **17:11** - Redis 6 queue started to decrease * **17:15** - _INCIDENT ENDS_ Redis 6 queue emptied * **17:18** - Original Deployments moved to Redis 6 instance * **17:31** - Deleted auxiliary Deployments * **17:44** - Redis 6 migration completed # Action Items as result of Postmortem * Investigate GitLab CI bug * Update Redis migration playbook * Recompose internal metrics during the incident