LogDNA incident

Many services briefly halted due to Cloud Provider incident

LogDNA experienced a notice incident on January 20, 2022 affecting Log Ingestion (Agent/REST API/Code Libraries) and Log Ingestion (Heroku) and 1 more component, lasting 1h 40m. The incident has been resolved; the full update timeline is below.

Started: Jan 20, 2022, 07:57 PM UTC
Resolved: Jan 20, 2022, 09:38 PM UTC
Duration: 1h 40m
Detected by Pingoru: Jan 20, 2022, 07:57 PM UTC

Affected components

Log Ingestion (Agent/REST API/Code Libraries)Log Ingestion (Heroku)Log Ingestion (Syslog)Web AppSearchAlertingLivetail

Update timeline

investigating Jan 20, 2022, 07:57 PM UTC

Our Cloud Provider Equinix is having an incident (see https://status.equinixmetal.com/incidents/gjmh37y6rkjp). For about 5-10 minutes, ingestion was halted and the WebUI was not responsive. Some alerts may have not been triggered. Currently all services are working and there are some delays in processing recently sent logs. We are monitoring Equinix’s incident closely.
monitoring Jan 20, 2022, 09:07 PM UTC

Logs are being ingested again without delays. All services are working normally. We will monitor until our Cloud Provider closes their incident.
resolved Jan 20, 2022, 09:38 PM UTC

This incident has been resolved. All services are operational.
postmortem Jan 21, 2022, 07:47 PM UTC

**Dates:** Start Time: Thursday, January 20, 2022, at 19:13:00 UTC End Time: Thursday, January 20, 2022, at 21:24:00 UTC Duration: 02:11:00 **What happened:** Ingestion was halted and our Web UI was unresponsive for about 5-10 minutes. Newly submitted logs were not immediately available for Alerting, Searching, Live Tail, Graphing, and Timelines. **Why it happened:** Our service hosting provider Equinix Metal had an outage that was caused by the failure of one of their main switches \(more details at [https://status.equinixmetal.com/incidents/gjmh37y6rkjp](https://status.equinixmetal.com/incidents/gjmh37y6rkjp)\). The outage impacted traffic and global network connectivity to the LogDNA service. During the Equinix Metal incident, Ingestion, Alerting, and Live Tail were halted and our Web UI was unresponsive for a period of 5-10 minutes. Multiple ElasticSearch \(ES\) clusters went into an unhealthy state which caused delays for about one hour in newly submitted logs being immediately available for Searching, Graphing, and Timelines. **How we fixed it:** No remedial action was possible by LogDNA. We waited until the incident from Equinix Metal, our service hosting provider, was resolved. The ES clusters were repaired and the backlog of newly submitted logs was processed in about one hour. **What we are doing to prevent it from happening again:** For this type of incident, LogDNA cannot take proactive preventive measures.