TrekkSoft incident

Global issues in several Trekksoft production functionalities

TrekkSoft experienced a major incident on October 22, 2023 affecting TrekkSoft Backoffice and TrekkSoft API and 1 more component, lasting 23h 22m. The incident has been resolved; the full update timeline is below.

Started: Oct 22, 2023, 08:41 AM UTC
Resolved: Oct 23, 2023, 08:04 AM UTC
Duration: 23h 22m
Detected by Pingoru: Oct 22, 2023, 08:41 AM UTC

Affected components

TrekkSoft BackofficeTrekkSoft APITrekkSoft Mobile App (mPOS)POS DeskTrekkSoft Website Builder

Update timeline

investigating Oct 22, 2023, 08:41 AM UTC

We are currently experiencing global issues in several Trekksoft production functionalities (a.o. Backoffice, POS Desk). Our developers are already investigating on finding the root cause of the issue. We will keep you updated and apologize for the inconvenience caused.
monitoring Oct 22, 2023, 09:31 AM UTC

Trekksoft functionalities are again operational for the vast majority of our users, however, we keep working on it and monitoring the situation.
resolved Oct 23, 2023, 08:04 AM UTC

The incident has been resolved and all the Trekksoft functionalities are again operational as normal. We will provide a postmortem of the incident in the following days. Once again we want to apologize for any inconvenience this might have caused you.
postmortem Oct 27, 2023, 09:41 AM UTC

**Incident Date**: October 22, 2023 **Incident Duration**: Approximately 1 hour **Affected Services**: All services **Incident Description**: At approximately 9:30 AM CET on October 22, 2023, an incident occurred on our database due to a disk space issue. The disk associated with the DB instance reached full capacity, causing a disruption in our database operations. As a result, all services dependent on this database were impacted. **Impact**: The incident rendered it impossible to perform write operations on the database, leading to a halt in functionality for all of our services. This disruption lasted for around one hour until our developers were able to mitigate the issue. **Resolution**: The incident was resolved by leveraging auto-scaling capabilities and expanding the disk size, alleviating the disk full issue and restoring normal operations. **Preventive Measures and Recommendations:** 1. **Alert Monitoring**: Monitoring to prevent disk size reaching full capacity. 2. **Auto-scaling Configuration**: Enable auto-scaling function which helps in automatically adjusting resource capacity to accommodate increased demand. By implementing these measures, we can proactively address and mitigate similar incidents in the future, ensuring the continued reliability of our services.