Templafy incident

Service degradation - Users encounter issues when accessing West Europe (Production 1)

Major Resolved View vendor source →

Templafy experienced a major incident on February 13, 2025 affecting Library & Dynamics, lasting 1h 31m. The incident has been resolved; the full update timeline is below.

Started
Feb 13, 2025, 12:10 PM UTC
Resolved
Feb 13, 2025, 01:42 PM UTC
Duration
1h 31m
Detected by Pingoru
Feb 13, 2025, 12:10 PM UTC

Affected components

Library & Dynamics

Update timeline

  1. identified Feb 13, 2025, 01:00 PM UTC

    We have identified an issue that affects a subset of customers and are working towards a resolution. Further updates will be posted here soon.

  2. monitoring Feb 13, 2025, 01:01 PM UTC

    The incident has been successfully mitigated, and our team is actively monitoring the situation to ensure ongoing stability and performance. We are observing the systems to prevent any further disruptions.

  3. resolved Feb 13, 2025, 01:42 PM UTC

    The incident has been resolved, and further information will be provided in a postmortem shortly. We apologize for the impact to affected customers.

  4. postmortem Feb 20, 2025, 02:50 PM UTC

    **Incident Initiation** On February 13, 2025, at 1:27 AM CET, an issue was introduced which lead to excessive CPU utilization due to heavy database usage in West Europe \(Production 1\). The issue remained undetected until February 13, 2025, at 1:18 PM CET, when monitoring systems flagged abnormal resource consumption. The impact was significant, affecting multiple tenants and a large number of users. The degraded performance resulted in a high rate of exceptions/errors in system logs and hindered application functionality. **Investigation** The engineering team initiated an immediate investigation on February 13, 2025, at 1:36 PM CET. They hypothesized that the issue was related to an inefficient database query or workload increase. Further analysis confirmed that the heavy database usage was the root cause of CPU maxing out, leading to performance degradation. **Mitigation and Resolution** To mitigate the incident, the engineering team promptly scaled up the database resources at 1:40 PM CET to stabilize the application. Continuous monitoring was implemented to track system performance and ensure stability. By 2:29 PM CET, the application was stabilized, and error rates significantly reduced. The team planned to revert the database resources to normal levels by the following morning to ensure optimal operation. **Impact and Scope** The incident impacted multiple tenants across various clusters, leading to performance degradation for affected users. The issue was widespread, affecting application responsiveness and generating increased system errors. **Post-Incident Actions** In response to this incident, the engineering team will implement several post-incident actions, including a detailed review of database query efficiency and workload distribution. Additional procedural improvements will be made to monitor resource consumption more closely and introduce alerting mechanisms for early anomaly detection. We sincerely apologize for the disruption caused by this incident. Our engineering team is committed to ensuring service reliability and stability. We appreciate our customers' patience and understanding as we continue to enhance our monitoring and mitigation processes.