LogDNA incident

Service Unavailable

Minor Resolved View vendor source →

LogDNA experienced a minor incident on December 17, 2020 affecting Log Ingestion (Agent/REST API/Code Libraries) and Log Ingestion (Heroku) and 1 more component, lasting 2d 4h. The incident has been resolved; the full update timeline is below.

Started
Dec 17, 2020, 11:29 PM UTC
Resolved
Dec 20, 2020, 03:49 AM UTC
Duration
2d 4h
Detected by Pingoru
Dec 17, 2020, 11:29 PM UTC

Affected components

Log Ingestion (Agent/REST API/Code Libraries)Log Ingestion (Heroku)Log Ingestion (Syslog)Web AppSearchAlertingLivetailArchiving

Update timeline

  1. investigating Dec 17, 2020, 11:29 PM UTC

    We are currently investigating an issue that is rendering our service unavailable at this time.

  2. identified Dec 17, 2020, 11:40 PM UTC

    All Services are unavailable due to an incident with our hosting provider. More information can be found here https://status.equinixmetal.com/

  3. identified Dec 18, 2020, 01:20 AM UTC

    Our provider has almost completely recovered from their incident. We are preparing to restart our own services.

  4. identified Dec 18, 2020, 02:10 AM UTC

    Our provider has now fully recovered from their incident. We have begun bringing our services back online. Please note logs may be unavailable in the web app until we have fully recovered.

  5. identified Dec 18, 2020, 03:33 AM UTC

    We continue to make progress on restoring our service to full functionality. Please note logs may be unavailable in the web app until we have fully recovered.

  6. identified Dec 18, 2020, 03:46 AM UTC

    We continue to make progress on restoring our service to full functionality. Please note logs may be unavailable in the web app until we have fully recovered.

  7. identified Dec 18, 2020, 04:47 AM UTC

    We continue to make progress on restoring our service to full functionality. Please note logs may be unavailable in the web app until we have fully recovered.

  8. identified Dec 18, 2020, 07:54 AM UTC

    New logs are being ingested again, although there is a large backlog to process. Searching, timelines, and alerting based on newly sent logs will be delayed. Live tail is working normally. Logs sent during our provider’s outage (from approximately 23:00 UTC to 3:00 UTC) are still unavailable in our UI.

  9. identified Dec 18, 2020, 09:20 AM UTC

    Ingestion of new logs is working normally. Logs sent to our service since about 3:00 UTC have mostly been ingested and are mostly available for searching and timelines. Logs sent during our provider’s outage (from approximately 23:00 UTC to 3:00 UTC) are still unavailable in our UI.

  10. identified Dec 18, 2020, 12:30 PM UTC

    Ingestion of new logs is working normally. Logs sent to our service since about 3:00 UTC have now been ingested and are available for searching and timelines. Logs sent during our provider’s outage (from approximately 23:00 UTC to 3:00 UTC) are still unavailable in our UI.

  11. monitoring Dec 18, 2020, 03:36 PM UTC

    Service has now been restored. We are monitoring the environment closely at this time. Logs sent during our provider’s outage (from approximately 23:00 UTC to 3:00 UTC) are still unavailable in our UI.

  12. monitoring Dec 18, 2020, 08:48 PM UTC

    All services are operational. We continue to work on making logs sent during our provider’s outage (from approximately 23:00 UTC to 3:00 UTC) available in our UI.

  13. resolved Dec 20, 2020, 03:49 AM UTC

    All services are operational. Most logs sent on December 17th for the six hours between 6 pm ET and midnight ET are not available in the UI. Although this incident is now closed, we will continue to work to make archives of these logs available to customers who chose to enable the archiving feature.

  14. postmortem Jan 14, 2021, 10:30 PM UTC

    **Dates:** The incident was opened on December 17, 2020 - 23:29 UTC. Our service was fully operational by December 18, 2020 - 12:30 UTC. The incident was officially closed on December 20, 2020 - 03:49 UTC. ‌ **What happened:** All services were unavailable for about eight hours. For an additional four hours, services were available but there were significant delays in searching, graphing, and timelines for newly submitted logs. Additionally, all logs submitted during the first six hours of the incident were never processed by our service and were unavailable in the UI, even after our service was fully operational. ‌ **Why it happened:** Our hosting provider had a major power failure that lasted almost five hours. The hardware that our service runs on was unavailable and none of our services could operate. More details: [https://status.equinixmetal.com/incidents/pfgmgy1fnjcp](https://status.equinixmetal.com/incidents/pfgmgy1fnjcp) ‌ **How we fixed it:** Once our provider was back online, we gradually restarted all our services. This took time and manual intervention because all our services had been taken down ungracefully by the outage. Around December 18, 2020 - 07:54 UTC, services became operational and logs began to be ingested again. Since no logs had been ingested for about eight hours, our service had a large backlog to process. As it caught up, users experienced delays in searching, graphing, and timelines for newly submitted logs. The backlog was fully processed around December 18, 2020 - 12:30 UTC and services were once again fully operational. Logs submitted during the first six hours of the incident \(around December 17, 2020, 23:00 UTC to December 18, 2020, 5:00 UTC\) remained unavailable in the UI. Normally, if our service is temporarily unavailable, logs can be resubmitted and successfully processed. In this case, the sudden loss of power brought down our services ungracefully, abruptly interrupting write operations as we processed logs. This resulted in partial writes and bad writes, which made our service unable to determine, for the resubmitted logs, where log lines began. In effect, this made logs resubmitted from that six hour period of time unreadable and unable to be processed. The incident was kept open as we made attempts to read and process these logs, but these efforts were ultimately unsuccessful. After the incident was closed, we developed the means to restore archives of these logs to all customers with version 3 of archiving enabled. The restoration of archives is expected to begin on the week of January 18th. ‌ **What we are doing to prevent it from happening again:** We are developing changes to how we write logs so that in a similar event our service will not lose track of the start of log lines and be able to read and process them.