Knak incident

Application down - 500 error

Critical Resolved View vendor source →

Knak experienced a critical incident on April 3, 2025 affecting Knak App, lasting 1h 21m. The incident has been resolved; the full update timeline is below.

Started
Apr 03, 2025, 12:09 PM UTC
Resolved
Apr 03, 2025, 01:30 PM UTC
Duration
1h 21m
Detected by Pingoru
Apr 03, 2025, 12:09 PM UTC

Affected components

Knak App

Update timeline

  1. investigating Apr 03, 2025, 12:09 PM UTC

    We are currently investigating this issue.

  2. monitoring Apr 03, 2025, 12:29 PM UTC

    A fix has been implemented and we are monitoring the results.

  3. monitoring Apr 03, 2025, 12:29 PM UTC

    We are continuing to monitor for any further issues.

  4. monitoring Apr 03, 2025, 12:57 PM UTC

    We are continuing to monitor for any further issues.

  5. resolved Apr 03, 2025, 01:30 PM UTC

    This incident has been resolved.

  6. postmortem Apr 04, 2025, 07:36 PM UTC

    On April 3rd at 7:26 AM EDT, our application experienced an incident resulting in unresponsiveness and the inability to serve requests. Upon immediate awareness, remediation efforts were initiated, and a temporary fix was implemented by 8:25 AM EDT, at which time resolution of the underlying cause was believed to be achieved. On April 4th at 8:48 AM EDT, the application again became unresponsive. A swift response facilitated the deployment of another temporary fix by 9:03 AM EDT on April 4th. Subsequent investigation revealed the root cause to be the unintended accumulation of temporary server operating logs, leading to disk space exhaustion on the application server. A permanent fix has since been implemented to prevent further log accumulation, and no additional space consumption has been observed. No action is required from our customers regarding this incident. **Root Cause:** During runtime our logging system was failing to send logs to our external logging tools this prompted the server to fail back to logging within the server, the logging directory was ephemeral storage so a new deployment would temporarily fix the issue. It was not until April 3rd at 7:26AM EDT that our temporary storage had completely filled causing the server to not have any disk swap space; rendering the server to be unable to respond to request **Actions:** 1. We have since fixed the issue with our logging so that we are not failing back to logging on the server 2. We have also added strict alerting on our temporary storage on our server to ensure that we are alerted if we are soon going to run out of space