LogDNA incident

Logs are not searchable in web app

LogDNA experienced a major incident on March 4, 2021 affecting Web App and Search and 1 more component, lasting 21m. The incident has been resolved; the full update timeline is below.

Started: Mar 04, 2021, 08:00 AM UTC
Resolved: Mar 04, 2021, 08:21 AM UTC
Duration: 21m
Detected by Pingoru: Mar 04, 2021, 08:00 AM UTC

Affected components

Web AppSearchLivetail

Update timeline

investigating Mar 04, 2021, 08:00 AM UTC

We are currently investigating an issue that is rendering our log viewer unavailable at this time.
monitoring Mar 04, 2021, 08:12 AM UTC

A fix has been implemented and we are monitoring the results.
resolved Mar 04, 2021, 08:21 AM UTC

This incident has been resolved and logs are searchable in the web app. We'll continue to monitor all services.
postmortem Mar 25, 2021, 05:30 PM UTC

**Dates:** Start Time: Thursday, March 4, 2021, at ~03:45 UTC End Time: Thursday, March 4, 2021, at ~08:20 UTC Duration: ~4:36:00 ‌ **What happened:** Our Web UI returned an error message "Request returned an error. Try again?" when users tried to perform a search query or use Live Tail in the Web UI. ‌ **Why it happened:** The pods that run our searching and Live Tail services were automatically terminated by our Kubernetes orchestration system. Upon investigation, we discovered we had inadvertently classed these services as low priority. The incident occurred when a large number of other services that were classed as higher priority needed to run to meet usage demands. The orchestration system automatically terminated the lower priority services to make resources available for the higher priority services. More specifically, these pods were put into a “terminating” state. Normally this state is temporary -- a transition between “running” and “terminated”. During this incident, the pods remained in the “terminating” state permanently. Our monitoring detects services that have been “terminated”, but not ones that are in the temporary “terminating” state. Consequently, our infrastructure team was not notified. ‌ **How we fixed it:** We increased the priority of the pods that run our searching and Live Tail services to match the priority of other services. We updated the configuration of our orchestration system to make the change permanent. ‌ **What we are doing to prevent it from happening again:** We’ve already updated the configuration of our orchestration system to give services the correct priority. These changes are permanent and should prevent similar problems in the future.