LogDNA incident

Ingestion of new logs — for Syslog only - is intermittently failing

LogDNA experienced a major incident on February 18, 2022 affecting Log Ingestion (Syslog), lasting 33m. The incident has been resolved; the full update timeline is below.

Started: Feb 18, 2022, 01:41 AM UTC
Resolved: Feb 18, 2022, 02:15 AM UTC
Duration: 33m
Detected by Pingoru: Feb 18, 2022, 01:41 AM UTC

Affected components

Log Ingestion (Syslog)

Update timeline

investigating Feb 18, 2022, 01:41 AM UTC

Ingestion of new logs to our Syslog endpoint is intermittently failing. We are investigating.
resolved Feb 18, 2022, 02:15 AM UTC

This incident has been resolved. If your team is still unable to send logs via syslog, please let us know at [email protected]
postmortem Mar 01, 2022, 08:21 PM UTC

**Dates:** Start Time: Wednesday, February 17, 2022, at 20:56 UTC End Time: Thursday, February 18, 2022, at 02:15 UTC Duration: 5:19:00 ‌ **What happened:** The ingestion of new logs to our Syslog endpoint was intermittently failing. ‌ **Why it happened:** We made a code change to the area of our service \(Syslog Forwarder\) that handles the ingestion of logs sent by Syslog and inadvertently changed how memory is managed. Routine memory garbage collection stopped and memory usage increased on the pods that accept newly submitted log lines over Syslog. Eventually, the increase in memory caused the pods to crash. Any log lines held on those pods were lost and never ingested. ‌ **How we fixed it:** We reverted to the previous version of the Syslog Forwarder service. This stopped the pods from crashing. We then resolved the memory management issue in our code. The new, fixed version was released to production shortly thereafter and performed as expected. ‌ **What we are doing to prevent it from happening again:** We have added regression tests to the Syslog Forwarder service to prevent a similar mistake in the future.