LogDNA incident

Many services briefly halted due to Cloud Provider incident

Notice Resolved View vendor source →

LogDNA experienced a notice incident on January 20, 2022 affecting Log Ingestion (Agent/REST API/Code Libraries) and Log Ingestion (Heroku) and 1 more component, lasting 1h 40m. The incident has been resolved; the full update timeline is below.

Started
Jan 20, 2022, 07:57 PM UTC
Resolved
Jan 20, 2022, 09:38 PM UTC
Duration
1h 40m
Detected by Pingoru
Jan 20, 2022, 07:57 PM UTC

Affected components

Log Ingestion (Agent/REST API/Code Libraries)Log Ingestion (Heroku)Log Ingestion (Syslog)Web AppSearchAlertingLivetail

Update timeline

  1. investigating Jan 20, 2022, 07:57 PM UTC

    Our Cloud Provider Equinix is having an incident (see https://status.equinixmetal.com/incidents/gjmh37y6rkjp). For about 5-10 minutes, ingestion was halted and the WebUI was not responsive. Some alerts may have not been triggered. Currently all services are working and there are some delays in processing recently sent logs. We are monitoring Equinix’s incident closely.

  2. monitoring Jan 20, 2022, 09:07 PM UTC

    Logs are being ingested again without delays. All services are working normally. We will monitor until our Cloud Provider closes their incident.

  3. resolved Jan 20, 2022, 09:38 PM UTC

    This incident has been resolved. All services are operational.

  4. postmortem Jan 21, 2022, 07:47 PM UTC

    **Dates:** Start Time: Thursday, January 20, 2022, at 19:13:00 UTC End Time: Thursday, January 20, 2022, at 21:24:00 UTC Duration: 02:11:00 **What happened:** Ingestion was halted and our Web UI was unresponsive for about 5-10 minutes. Newly submitted logs were not immediately available for Alerting, Searching, Live Tail, Graphing, and Timelines. **Why it happened:** Our service hosting provider Equinix Metal had an outage that was caused by the failure of one of their main switches \(more details at [https://status.equinixmetal.com/incidents/gjmh37y6rkjp](https://status.equinixmetal.com/incidents/gjmh37y6rkjp)\). The outage impacted traffic and global network connectivity to the LogDNA service. During the Equinix Metal incident, Ingestion, Alerting, and Live Tail were halted and our Web UI was unresponsive for a period of 5-10 minutes. Multiple ElasticSearch \(ES\) clusters went into an unhealthy state which caused delays for about one hour in newly submitted logs being immediately available for Searching, Graphing, and Timelines. **How we fixed it:** No remedial action was possible by LogDNA. We waited until the incident from Equinix Metal, our service hosting provider, was resolved. The ES clusters were repaired and the backlog of newly submitted logs was processed in about one hour. **What we are doing to prevent it from happening again:** For this type of incident, LogDNA cannot take proactive preventive measures.