LogDNA incident

Ingestion of new logs — for Syslog only - Partial Outage

LogDNA experienced a major incident on February 24, 2022 affecting Log Ingestion (Syslog), lasting 57m. The incident has been resolved; the full update timeline is below.

Started: Feb 24, 2022, 10:45 PM UTC
Resolved: Feb 24, 2022, 11:43 PM UTC
Duration: 57m
Detected by Pingoru: Feb 24, 2022, 10:45 PM UTC

Affected components

Log Ingestion (Syslog)

Update timeline

identified Feb 24, 2022, 10:45 PM UTC

New logs — from Syslog only -- are intermittently not being ingested by our service. We are working to restore this functionality as soon as possible.
resolved Feb 25, 2022, 12:16 AM UTC

This incident has been resolved. Please reach out to us at [email protected] with any additional questions.
postmortem Mar 01, 2022, 08:28 PM UTC

**Dates:** Start Time: Thursday, February 18, 2022, at 00:10 UTC End Time: Thursday, February 24, 2022, at 23:43 UTC Duration: 167:33:00 ‌ **What happened:** The ingestion of new logs to our Syslog endpoint was intermittently failing. ‌ **Why it happened:** We recently introduced a new service \(Syslog Forwarder\) to handle the ingestion of logs sent over Syslog. As the name implies, it forwards logs to downstream services. It was designed to send all logs submitted for each account to a single port opened on the downstream services. No load balancing was implemented in our original design, which performed well in our advance testing. Once put into production, however, it became apparent that some customer accounts submit logs at a volume higher than the downstream services could process. When this happened, logs lines were buffered in memory by the Syslog Forwarder. Memory increased until the pods crashed. Any log lines held on those pods were lost and never ingested. ‌ **How we fixed it:** We improved the design of the Syslog Forwarder by adding a pool of connections to the downstream services. In effect, we added traffic shaping to the Syslog Forwarder. ‌ **What we are doing to prevent it from happening again:** The new architecture has been incorporated and proven resilient in production. No further work is needed to prevent this kind of incident from happening again.