TrekkSoft incident

TrekkSoft global issue affecting several production functionalities

TrekkSoft experienced a major incident on August 27, 2021 affecting TrekkSoft Backoffice and TrekkSoft API and 1 more component, lasting 3h 19m. The incident has been resolved; the full update timeline is below.

Started: Aug 27, 2021, 11:16 AM UTC
Resolved: Aug 27, 2021, 02:36 PM UTC
Duration: 3h 19m
Detected by Pingoru: Aug 27, 2021, 11:16 AM UTC

Affected components

TrekkSoft BackofficeTrekkSoft APITrekkSoft Mobile App (mPOS)POS DeskTrekkSoft Website Builder

Update timeline

investigating Aug 27, 2021, 11:16 AM UTC

We are currently experiencing global issues in several Trekksoft production functionalities. Our developers are already investigating on finding the root cause of the issue. We will keep you updated and apologize for the inconvenience caused.
monitoring Aug 27, 2021, 11:50 AM UTC

Our developers identified the root cause of the issue and already found a fix for it. Issue occurred due to unusual spike in network traffic which affected performance of our servers. Trekksoft functionalities are again performing normally. Our developers are continuing to monitor the issue closely to ensure there are no further performance issues.
monitoring Aug 27, 2021, 12:01 PM UTC

We are still facing some issues with taking bookings and payments. Our developers are continuously working towards finding a fix for this issue as well.
monitoring Aug 27, 2021, 12:56 PM UTC

Issues with bookings and payments have been addressed as well. Trekksoft functionalities are again performing as expected. Our developers are continuing to monitor the issue closely to ensure there are no further issues.
resolved Aug 27, 2021, 02:36 PM UTC

The incident has been resolved and all the functionalities are performing as expected. We are continuing to investigate what encountered this unusual spike in network traffic and will provide postmortem once the investigation is finished.
postmortem Aug 30, 2021, 12:15 PM UTC

**What happened?** One of external services, our system is using, was experiencing issues. This affected our web servers which were overloaded and finally crashed as the result of it. **What we did?** When our developers identified the issue, they fixed it by restarting failed services and servers for which it took some time to get back to the normal operating state. **Impact** During the estimated 2h time frame \(12.30 to 14.30 CEST\) of the incident no bookings have been able to be processed as well as our customers had issues with accessing our web and mobile application. **Learnings** In order to prevent similar incidents in the future we added additional monitoring system to the before mentioned external service as well as servers that the service is used on. With this in place, we will have a better overview in case similar issue happens again and this will allow us to act faster and minimize the impact this has on our customers. We apologize for any inconvenience this might have caused you.