Landbot experienced a major incident on June 20, 2022 affecting Builder, lasting 21m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Jun 20, 2022, 02:48 PM UTC
There is an ongoing incident affecting the Landbot app. The bots are working correctly. We are investigating and working on the issue with maximum priority.
- investigating Jun 20, 2022, 02:48 PM UTC
We are continuing to investigate this issue.
- identified Jun 20, 2022, 02:49 PM UTC
The issue has been identified and a fix is being implemented.
- identified Jun 20, 2022, 02:50 PM UTC
We are continuing to work on a fix for this issue.
- monitoring Jun 20, 2022, 02:51 PM UTC
A fix has been implemented and we are monitoring results.
- resolved Jun 20, 2022, 03:10 PM UTC
This incident has been resolved
- postmortem Jun 24, 2022, 03:04 PM UTC
# Incident summary Between the hour of 16:23 and 16:46 CEST on the 20th of June, a cascading failure during a Redis migration to version 6 affected some deployments, resulting in a service outage. # Impact This incident affected some customers, who experienced a downtime in the application for 23 minutes. # Root Causes During a Redis migration, a bug in the GitLab CI pipeline was identified. This bug truncates the trailing 0 of the container image tag, so it deployed a release version `1.2` instead of `1.20`, causing a CrashLoopBackOff error for one of our Deployment containers. The environment variables for the Deployments were moved to the new Redis 6 instance when the Deployments were still in CrashLoopBackOff. # Trigger One of our Deployment’s ConfigMap was modified, changing the Redis server environment variable to the new Redis 6 instance. The restart of these Deployments updated their ConfigMaps, when one of them were still in a CrashLoopBackOff status, causing a cascading failure that affected other Deployments and triggering the incident. # Detection The Redis 6 queue started to fill up with tasks ahead of time, and the application stopped working. # Timeline **2022-06-20 \(all times are CEST\)** * **15:29** - Created new Redis 6 instance * **15:44** - Created ConfigMap and Secret for auxiliary Deployments, containing environment variables pointing to Redis 6 instance * **15:59** - _INCIDENT BEGINS_ Created auxiliary Deployments, that were deployed with release version `1.2` instead of `1.20` * **16:17** - Updated environment variables, pointing to Redis 6 instance * **16:19** - Redis 6 queue began to fill up progressively * **16:23** - _OUTAGE BEGINS_ The application stopped working * **16:39** - Deleted auxiliary Deployments * **16:46** - Redis 6 queue reached 309K length * **16:46** - _OUTAGE MITIGATED_ Environment variables restored to previous Redis 5 instance * **16:46** - _OUTAGE ENDS_ All services restored and working correctly * **17:08** - Followed procedure to continue with the Redis 6 migration * **17:10** - After running GitLab CI pipeline to create auxiliary Deployments, container images were manually set to release `1.20` * **17:11** - Redis 6 queue started to decrease * **17:15** - _INCIDENT ENDS_ Redis 6 queue emptied * **17:18** - Original Deployments moved to Redis 6 instance * **17:31** - Deleted auxiliary Deployments * **17:44** - Redis 6 migration completed # Action Items as result of Postmortem * Investigate GitLab CI bug * Update Redis migration playbook * Recompose internal metrics during the incident