Mezmo incident

Delays in Alerting, Searching, Live Tail, Graphing, and Timelines. WebUI intermittently unavailable.

Mezmo experienced a notice incident on February 8, 2022 affecting Web App and Search and 1 more component, lasting 1h 2m. The incident has been resolved; the full update timeline is below.

Started: Feb 08, 2022, 01:38 PM UTC
Resolved: Feb 08, 2022, 02:40 PM UTC
Duration: 1h 2m
Detected by Pingoru: Feb 08, 2022, 01:38 PM UTC

Affected components

Web AppSearchAlertingLivetail

Update timeline

investigating Feb 08, 2022, 01:38 PM UTC

We are currently investigating the issue.
investigating Feb 08, 2022, 01:55 PM UTC

We are continuing to investigate the issue. The web app can be accessible intermittently. Logs can arrive with a delay which will impact searching and alerting.
monitoring Feb 08, 2022, 02:25 PM UTC

We have implemented a fix and are monitoring the results. Newly sent logs are being processed again with minimal delays and all services are operational.
resolved Feb 08, 2022, 02:40 PM UTC

This incident has been resolved.
postmortem Feb 15, 2022, 07:23 PM UTC

**Dates:** Start Time: Tuesday, February 8, 2022, at 13:17 UTC End Time: Tuesday, February 8, 2022, at 14:21 UTC Duration: 1:04:00 ‌ **What happened:** Our Web UI was unresponsive for about 10 minutes. Newly submitted logs were not immediately available for Alerting, Searching, Live Tail, Graphing, and Timelines. No data was lost and ingestion was not halted. ‌ **Why it happened:** Our Redis database had a failover and the services that depend on it were unable to reconnect after it recovered, including the Parser. This service is upstream of many other services. Consequently, newly submitted logs were not passed on to many downstream services, such as Alerting, Live Tail, Searching, Graphing, and Timelines. The WebUI was also intermittently unavailable because it requires a connection to Redis. ‌ **How we fixed it:** We manually restarted the Redis service which allowed a new master to be elected. After Redis recovered, the Parser, Web UI and other services were restarted which were then able to reestablish a connection to Redis. This restored the Web UI and allowed newly submitted logs to pass from our Parser service to all downstream services. Over a short period of time, these services processed the backlog of logs and newly submitted logs were again available without delays. ‌ **What we are doing to prevent it from happening again:** We recently added functionality to track the flow rate of newly submitted logs. This new feature requires more memory than expected in the event of a Redis failover, which is why services could not reconnect to Redis. We’ve increased the limits of the memory buffer for the relevant portions of our service. We will also add additional Redis monitoring to more quickly detect unhealthy sentinels and continue to work on an ongoing project to make all services more tolerant of Redis failovers.