Mezmo incident

Ingestion, Searching, Live Tail, Alerting, Graphing, and Timelines Delays

Mezmo experienced a critical incident on January 26, 2022 affecting Log Ingestion (Agent/REST API/Code Libraries) and Log Ingestion (Heroku) and 1 more component, lasting 1h 5m. The incident has been resolved; the full update timeline is below.

Started: Jan 26, 2022, 04:10 PM UTC
Resolved: Jan 26, 2022, 05:15 PM UTC
Duration: 1h 5m
Detected by Pingoru: Jan 26, 2022, 04:10 PM UTC

Affected components

Log Ingestion (Agent/REST API/Code Libraries)Log Ingestion (Heroku)Log Ingestion (Syslog)Web AppSearchAlertingLivetail

Update timeline

investigating Jan 26, 2022, 04:10 PM UTC

Ingestion services are currently halted. Customers will also experience delays with Searching, Live Tail, Alerting, Graphing, and Timelines.
investigating Jan 26, 2022, 04:23 PM UTC

We are continuing to investigate this issue.
monitoring Jan 26, 2022, 04:58 PM UTC

We have implemented a fix and are monitoring the results. Logs are being ingested again and all services are operational.
resolved Jan 26, 2022, 05:15 PM UTC

This incident has been resolved. All services are operational.
postmortem Jan 31, 2022, 08:14 PM UTC

**Dates:** Start Time: Wednesday, January 26, 2022, at 15:45:00 UTC End Time: Wednesday, January 26, 2022, at 16:30:00 UTC Duration: 00:45:00 ‌ **What happened:** Ingestion was halted and newly submitted logs were not immediately available for Alerting, Live Tail, Searching, Graphing, and Timelines. Some alerts were never triggered. Once ingestion had resumed, LogDNA agents running on customer environments resent all locally cached logs to our service for ingestion. No data was lost. ‌ **Why it happened:** Our Redis database had a failover and the services that depend on it were unable to recover automatically. Normally, the pods running our ingestion service deliberately crash until they are able to access Redis again. However, these pods were in a bad state and unable to reconnect when Redis returned. Since ingestion was halted, newly submitted logs were not passed on to many downstream services, such as Alerting, Live Tail, Searching, Graphing, and Timelines. ‌ **How we fixed it:** We manually restarted all the pods of our ingestion service, then restarted all the sentinel pods of Redis. The ingestion service became operational again and logs were passed on to all downstream services. Over a short period of time, these services processed the backlog of logs and newly submitted logs were again available without delays. ‌ **What we are doing to prevent it from happening again:** The ingestion pods were in a bad state because they had not been restarted after a configuration change made several days earlier, for reasons unrelated to this incident. The runbook for making such configuration changes has been updated to prevent this procedural failure in the future. We’re also in the middle of a project to make all services more tolerant of Redis failovers.