Palisis incident

Palisis Main Database Failure

Critical Resolved View vendor source →

Palisis experienced a critical incident on September 16, 2019 affecting Palisis Backend and Palisis Webshops, lasting 3h 32m. The incident has been resolved; the full update timeline is below.

Started
Sep 16, 2019, 10:03 AM UTC
Resolved
Sep 16, 2019, 01:35 PM UTC
Duration
3h 32m
Detected by Pingoru
Sep 16, 2019, 10:03 AM UTC

Affected components

Palisis BackendPalisis Webshops

Update timeline

  1. investigating Sep 16, 2019, 10:03 AM UTC

    We have identified an issue with our deployment today. We will roll back the release to 4.35.10

  2. monitoring Sep 16, 2019, 10:04 AM UTC

    We rolled back the deployment performed today. All systems should be operational again.

  3. investigating Sep 16, 2019, 10:08 AM UTC

    The issue is not yet resolved. We are further investigating the root cause and keep you updated here

  4. identified Sep 16, 2019, 11:20 AM UTC

    We are still searching for the root cause that our main database is nor responding. Be sure that resolving this issue is our highest priority and all hands are on deck. We are truly sorry for the trouble this creates for you and your business! As soon as we know more we post an update here.

  5. identified Sep 16, 2019, 12:36 PM UTC

    We are in constant contact with our hosting providers to resolve the issue. Unfortunately we still cannot give you an ETA yet.

  6. monitoring Sep 16, 2019, 01:16 PM UTC

    The system is back online. Due to the big load now coming onto the system from all offline sales channels it can be that the system remains slower for the next few minutes. We are monitoring the system and its stability closely.

  7. monitoring Sep 16, 2019, 01:20 PM UTC

    We are continuing to monitor for any further issues.

  8. resolved Sep 16, 2019, 01:35 PM UTC

    All systems are back operational. We will analyze in detail what happened and put in place means to prevent this from happening again, also together with our partners and suppliers. A detailed report will follow. Once again we are truly sorry for the problems we created today for your business. We are proud of our historical excellent uptime and this incident shows us all that we have to keep working hard that Palisis Uptime is something you can just take for granted as you did in the past.

  9. postmortem Oct 02, 2019, 03:51 PM UTC

    On Sep 16, 2019 - 09:10 UTC Palisis faced a worldwide downtime due to a failing system component. The architecture of the Palisis booking engine has been designed to be highly redundant and fail-proof. Every component of the system is either self-healing, redundant, or is equipped with an automatic failover to an active/standby system. On this day one of our message queues, which is responsible for handling tasks triggered by a booking asynchronously, failed and didn’t heal itself because the error was unexpected, silent, and originating from a third party service outside our direct control. The message queue system, which is a standard software named ActiveMQ and is used by thousands of systems worldwide, stopped accepting messages but didn’t return an error. It was still accepting incoming connections, but it was not confirming messages sent to it and didn’t time out the connection. Therefore incoming bookings ran into a timeout to prevent any data inconsistency or data loss. ‌ **Investigation** The only error messages we could see were failing writes to the database because of locked rows, caused by booking processes waiting for the message queue to accept the messages. The status of the message queue was fully functional according to all parameters, and it was still accepting new connections. Palisis uses ActiveMQ as a managed service hosted by Amazon Web Services, and we therefore have only limited access to the hardware. The service is configured with automatic failover to a standby instance, which wasn’t triggered because the software did not recognize the error. After the Palisis Operations team figured out that the problem was the message queue, the involved AWS Business Support team recommended that we scale up the instance size of the message queue servers. Only the total replacement of the message queue system, including the active/standby instances solved the problem. ‌ **Preventing this error from happening again** Next to our regular monitoring of component states, the system is now checking every component multiple times a minute from each application server for reads and writes. These tests are very fast and only need a few milliseconds. We’ve added additional tests like this to our message queues to monitor the correct processing of messages. The Palisis system has shown in the past 10 years that it is fail-proof and we’ve proven our reliability with an average availability of over 99.99% and 100% in most months. We’ve also proven with the move of the TourCMS System and the websites to AWS that we are capable of building stable systems. We will continue to operate our service in the best way possible, and we aspire 100% uptime.