LogDNA incident

Web UI unavailable and ingestion has stopped

Critical Resolved View vendor source →

LogDNA experienced a critical incident on January 14, 2021 affecting Log Ingestion (Agent/REST API/Code Libraries) and Log Ingestion (Heroku) and 1 more component, lasting 21m. The incident has been resolved; the full update timeline is below.

Started
Jan 14, 2021, 08:05 PM UTC
Resolved
Jan 14, 2021, 08:27 PM UTC
Duration
21m
Detected by Pingoru
Jan 14, 2021, 08:05 PM UTC

Affected components

Log Ingestion (Agent/REST API/Code Libraries)Log Ingestion (Heroku)Log Ingestion (Syslog)Web App

Update timeline

  1. investigating Jan 14, 2021, 08:05 PM UTC

    The web UI is unavailable and ingestion has stopped. We are investigating.

  2. resolved Jan 14, 2021, 08:27 PM UTC

    The web UI is available again and ingestion has resumed. All services are operational.

  3. postmortem Jan 19, 2021, 09:57 PM UTC

    **Dates:** Start Time: Thursday, January 14, 2021, at 19:42 UTC End Time: Thursday, January 14, 2021, at 20:27 UTC Duration: 0:45:00 **What happened:** Our WebUI became unavailable and ingestion of new logs stopped for 45 minutes. Logs were automatically resent later and ingested successfully for customers using our ingestion client agent. **Why it happened:** The certificate used by all our services expired. Consequently, all API calls to our service failed, which caused our WebUI to fail and ingestion of new logs to stop. **How we fixed it:** We renewed the certificate and applied it to all affected services. Our WebUI became responsive again and ingestion resumed. Since no logs had been ingested for about 45 minutes, our service had a moderately large backlog to process. As it caught up, users experienced delays in searching, graphing, and timelines for newly submitted logs. **What we are doing to prevent it from happening again:** We’re tightening our internal notifications of upcoming expiration dates for all certificates our service relies upon.