LogDNA incident

Degraded performance for WebUI, Ingestion, Alerting, Searching, Live Tail, Graphing, and Timelines

Minor Resolved View vendor source →

LogDNA experienced a minor incident on October 5, 2022 affecting Log Ingestion (Agent/REST API/Code Libraries) and Log Ingestion (Heroku) and 1 more component, lasting 1h 7m. The incident has been resolved; the full update timeline is below.

Started
Oct 05, 2022, 02:58 PM UTC
Resolved
Oct 05, 2022, 04:05 PM UTC
Duration
1h 7m
Detected by Pingoru
Oct 05, 2022, 02:58 PM UTC

Affected components

Log Ingestion (Agent/REST API/Code Libraries)Log Ingestion (Heroku)Log Ingestion (Syslog)Web AppSearchAlertingLivetail

Update timeline

  1. monitoring Oct 05, 2022, 02:58 PM UTC

    Service is restored but we are still monitoring.

  2. resolved Oct 05, 2022, 04:05 PM UTC

    This incident has been resolved. All services are fully operational.

  3. postmortem Oct 12, 2022, 06:52 PM UTC

    **Dates:** Start Time: Wednesday, October 5, 2022, at 14:27 UTC End Time: Wednesday, October 5, 2022, at 14:45 UTC Duration: 00:18 **What happened:** The ingestion of logs was partially halted. The WebUI was mostly unresponsive and most API calls failed. Because many newly submitted logs were not being ingested, new logs were not immediately available for Alerting, Searching, Live Tail, Graphing, Timelines, and Archiving. **Why it happened:** We recently added a new API gateway - Kong - to our service, that acts as a proxy for all other services. We had gradually increased the amount of traffic directed through the API gateway over several weeks and seen no ill effects. Prior to the incident, only some of the traffic for ingestion wen through the gateway. Kong was restarted after a routine configuration change. After the restart, all traffic for our ingestion service began to go through Kong. Our monitoring quickly revealed the Kong service did not have enough pods to keep up with the increased workload, causing many requests to fail. **How we fixed it:** We manually added more pods to the Kong service. Ingestion, the WebUI, and API calls began to work normally again. Once ingestion had resumed, LogDNA agents running on customer environments resent all locally cached logs to our service for ingestion. No data was lost. **What we are doing to prevent it from happening again:** We updated Kubernetes to always assign enough pods for the Kong API gateway service to be able to handle all traffic. We’ll update the Kong gateway to more evenly distribute ingestion traffic across available pods. We will adjust our deployment processes so pods are restarted more slowly, which will reduce the impact in a similar scenario. We’ll explore autoscaling policies so more pods could be added automatically in a similar situation.