TrekkSoft incident

Registered downtime 3rd of October 2024

TrekkSoft experienced a minor incident on October 3, 2024 affecting TrekkSoft Backoffice and TrekkSoft API and 1 more component, lasting 1h 39m. The incident has been resolved; the full update timeline is below.

Started: Oct 03, 2024, 11:28 AM UTC
Resolved: Oct 03, 2024, 01:07 PM UTC
Duration: 1h 39m
Detected by Pingoru: Oct 03, 2024, 11:28 AM UTC

Affected components

TrekkSoft BackofficeTrekkSoft APIPOS Desk

Update timeline

investigating Oct 03, 2024, 11:28 AM UTC

TrekkSoft experienced a downtime during the last hour, with the outage lasting approximately 10 minutes. Since then, all systems have come back online. We are actively investigating to determine whether this was due to a potential cyber attack, while also reviewing our infrastructure for other possible causes. We will provide further updates as soon as we have more information. We apologize for this inconvenience.
investigating Oct 03, 2024, 01:05 PM UTC

We are continuing to investigate this issue.
resolved Oct 03, 2024, 01:07 PM UTC

The incident has been resolved and all TrekkSoft functionalities are operating as expected. We have determined that the issue originated from one of our infrastructure services unexpectedly stopping and restarting, triggered a cascading effect, leading to a brief system outage. The responsibility for maintaining this service lies with our cloud services provider, AWS, and we have reached out to them for further clarification. We will explore measures to mitigate this type of issue on our end and will provide a postmortem of the incident in the coming days. We apologize once again for any inconvenience this may have caused.
postmortem Oct 04, 2024, 03:01 PM UTC

**Incident Date**: October 3rd 2024 **Incident Duration**: Approximately 20 minutes **Affected Services**: TrekkSoft API, TrekkSoft Application, POS Desk **Incident Description**: At approximately 12:15 PM CET on October 3rd, 2024, the system went down. **Impact**: The redis node used for session storage from API was rebooted and came back approximately 20 minutes later. The node went out of service outside the maintenance windows. We opened a support ticket with AWS to understand why it was rebooted. **Resolution**: The incident was resolved due to the rebooted redis node \(used for session storage from API\). **Learnings:** API uses redis-core-production for session storage. This is a one node instance. **Preventive Measures** * Review AWS Fault Tolerance reference:[ Mitigating Failures - Amazon ElastiCache \(Redis OSS\)](https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/FaultTolerance.html) * Ensure proper configuration is used for the redis-core-production instance * Review and improve our API session handling logic, or even consider other types of persistence for session storage