Convercent incident

Issue Manager and Disclosure Manager Loading Issues

Major Resolved View vendor source →

Convercent experienced a major incident on October 15, 2024 affecting EU Production, lasting 1h 17m. The incident has been resolved; the full update timeline is below.

Started
Oct 15, 2024, 01:40 PM UTC
Resolved
Oct 15, 2024, 02:58 PM UTC
Duration
1h 17m
Detected by Pingoru
Oct 15, 2024, 01:40 PM UTC

Affected components

EU Production

Update timeline

  1. investigating Oct 15, 2024, 01:40 PM UTC

    Users may be unable to load the Issue Manager or Disclosure Manager within the Convercent platform. The impact is specific to customers hosted on the Prod-EU environment. We’re investigating the issue and will provide an update within the next hour.

  2. resolved Oct 15, 2024, 02:58 PM UTC

    Corrective actions have been deployed and after a period of monitoring, we can confirm the resolution of this incident.

  3. postmortem Nov 25, 2024, 08:13 AM UTC

    # Event Description On Tuesday, October 15th, 2024, from 10:45 to 14:54 UTC \(4 hours and 9 minutes\), users of the Helpline Case Management platform in the EU Production environment experienced issues with the Issue Manager and Disclosure Manager screens. While users were able to log in, these screens failed to load and remained in an indefinite loading state. # Customer Impact Summary During the incident, affected users were unable to access or interact with the Issue Manager and Disclosure Manager screens, resulting in disruptions to their workflows and significant delays. ‌ # Findings and Root Cause The incident was caused by the simultaneous execution of multiple OData reports on the secondary database, which triggered queries with parameter sniffing issues. Concurrently, a database backup process was running, further straining the system. The OData reports were scheduled to run at the same time and lacked throttling mechanisms to limit concurrent requests. The parameter sniffing issues arose because the SQL Server version in use does not support multiple cache plans, a feature available in newer SQL Server versions. ## Mitigation The issue was resolved by clearing blocking queries, which alleviated the Disk I/O channel saturation on the secondary database. Following this action, the Issue Manager, Disclosure Manager, and OData queries resumed normal functionality. ### How could this incident have been avoided? By implementing workarounds to address parameter sniffing, such as using local variables in queries to ensure consistent execution plans. ### How could we have detected this issue sooner? By analyzing the execution patterns of scheduled OData reports and leveraging our internal monitoring tools to proactively identify performance issues. ### Is there sufficient monitoring to detect this incident proactively? Not yet. Implementing monitoring alerts would enable proactive notifications of similar incidents in the future. ### Is there a contingency or plan to control future incidents of this kind? Yes, parameter sniffing workarounds will be included in future releases, and additional monitoring measures will be implemented to mitigate similar risks. # Corrective Actions Short-term * Clear the blocking queries, which alleviated the Disk I/O channel saturation on the secondary database. Long-term * Introduce throttling or staggering mechanisms for OData report execution to avoid simultaneous overloads. * Incorporate parameter sniffing solutions as a best practice in all future database queries and releases. Implement proactive monitoring for the execution patterns of scheduled OData reports.