Logit.io incident

Major outage affecting EU Data Center

Critical Resolved View vendor source →

Logit.io experienced a critical incident on March 10, 2021 affecting Dashboard and Visualisation Hosts and 1 more component, lasting 17h 9m. The incident has been resolved; the full update timeline is below.

Started
Mar 10, 2021, 12:37 AM UTC
Resolved
Mar 10, 2021, 05:46 PM UTC
Duration
17h 9m
Detected by Pingoru
Mar 10, 2021, 12:37 AM UTC

Affected components

DashboardVisualisation HostsAlerting Hosts

Update timeline

  1. investigating Mar 10, 2021, 12:22 AM UTC

    We are currently investigating this issue.

  2. identified Mar 10, 2021, 12:26 AM UTC

    We are working with our hosting provider to restore access to services. We will update in 30 minutes.

  3. identified Mar 10, 2021, 12:37 AM UTC

    We are continuing to work on a fix for this issue. We will update in 30 minutes

  4. identified Mar 10, 2021, 01:03 AM UTC

    We are working with our hosting provider to restore access to services. We will update in 30 minutes.

  5. identified Mar 10, 2021, 01:28 AM UTC

    We are working with our hosting provider to restore access to services. We will update in 60 minutes

  6. identified Mar 10, 2021, 02:49 AM UTC

    We are working with our hosting provider to restore access to services. We will update in 60 minutes

  7. identified Mar 10, 2021, 03:31 AM UTC

    There has been a major fire at one of our data centers affecting some core services. Note this does not impact Logstash and Elasticsearch logs ingestion which remain unaffected. We have invoked our DR/BCP plan to migrate and restore the affected services to a different data center. We will update in 2 hours.

  8. identified Mar 10, 2021, 06:13 AM UTC

    Our engineers have restored the platform dashboard https://dashboard.logit.io and other core services. We will provide another update in 2 hours

  9. identified Mar 10, 2021, 08:13 AM UTC

    Our engineers are continuing to restore all services and are progressing well implementing our DR plan. Note: this does not impact Logstash and Elasticsearch logs ingestion which remain unaffected. We will update in 1 hour.

  10. identified Mar 10, 2021, 09:25 AM UTC

    Our engineers are in the process of recreating all Kibana instances and are progressing well implementing our DR plan. Note: this does not impact Logstash and Elasticsearch logs ingestion which remain unaffected. We will update in 1 hour.

  11. identified Mar 10, 2021, 10:35 AM UTC

    The majority of affected Kibana instances are now back online. Our engineers are continuing to work to restore other core services. We will update in 1 hour.

  12. identified Mar 10, 2021, 11:31 AM UTC

    All Kibana instances are now back online. If you are still having issue with Kibana please reach out to the team. Our engineers are continuing to work to restore other core services. We will update in 1 hour.

  13. identified Mar 10, 2021, 01:00 PM UTC

    All Kibana instances are now back online. If you are still having issue with Kibana please reach out to the support team. Our engineers are continuing to work to restore other core services including the shared api. We will update in 1 hour.

  14. identified Mar 10, 2021, 02:30 PM UTC

    All Kibana instances are now back online. If you are still having issue with Kibana please reach out to the support team. Our engineers are bringing the api and other core services back online now. We will update in 1 hour.

  15. identified Mar 10, 2021, 03:50 PM UTC

    The ingestion API is now back online. If you are still having issue with the API please reach out to the support team. Our engineers are working to bring the alerting infrastructure back online. We will update in 1 hour.

  16. monitoring Mar 10, 2021, 04:56 PM UTC

    We have recovered all of the major core services from backups, that had been lost in the EU data center fire. We will continue to monitor the platform for the coming hours to ensure stability, but we believe we have fully recovered all services, if you have any questions or need support please reach out to us

  17. resolved Mar 10, 2021, 05:46 PM UTC

    This incident has been resolved.

  18. postmortem Mar 24, 2021, 02:34 PM UTC

    [Click here to view postmortem](https://logit.io/blog/post/logit-io-platform-outage-incident-report)