Templafy incident

Service degradation: Slow Access on West Europe (Production 1)

Major Resolved View vendor source →

Templafy experienced a major incident on May 28, 2024 affecting Library & Dynamics, lasting 1h 52m. The incident has been resolved; the full update timeline is below.

Started
May 28, 2024, 08:45 AM UTC
Resolved
May 28, 2024, 10:37 AM UTC
Duration
1h 52m
Detected by Pingoru
May 28, 2024, 08:45 AM UTC

Affected components

Library & Dynamics

Update timeline

  1. identified May 28, 2024, 08:52 AM UTC

    We have identified an issue that affects a subset of customers and are working towards a resolution. Further updates will be posted here soon.

  2. monitoring May 28, 2024, 09:10 AM UTC

    The incident has been successfully mitigated, and our team is actively monitoring the situation to ensure ongoing stability and performance. We are observing the systems to prevent any further disruptions.

  3. resolved May 28, 2024, 10:37 AM UTC

    The incident has been resolved, and further information will be provided in a postmortem shortly. We apologize for the impact to affected customers.

  4. postmortem May 29, 2024, 12:34 PM UTC

    On May 28, 2024, at 10:45 AM CET, an incident impacting all users utilizing the Dynamics & Library system within the West Europe \(Production 1\) environment was detected. The issue caused the system to have degraded performance, causing the users to experience slow responses or even timeouts. The engineering team quickly discovered that the degraded performance was caused by the SQL server being under a heavy load due to a reindexing operation. The reindexing operation was part of a migration process that the engineering team was rolling out. At 11:00 CET, as an immediate mitigation, the engineering team initiated the capacity increase of the SQL server. By 11:05 CET, the extra resources to the SQL server were successfully allocated. At this time, the application performance restored to normal parameters, and the application users were no longer impacted. By 12:37 CET, the incident was resolved after the engineering team successfully applied the migration and confirmed it was working as expected. We are reviewing and enhancing our internal procedures for migrations to ensure that similar issues are prevented in the future.