Convercent experienced a critical incident on June 7, 2024 affecting EU Production, lasting 1h 19m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Jun 07, 2024, 09:06 AM UTC
We’ve identified an issue where users are unable to log in to the Production EU environment. Customers may observe “Invalid username or password or unknown account” and/or SSO Server Error in '/' Application errors. We’re investigating the issue and will provide an update as soon as possible.
- investigating Jun 07, 2024, 10:03 AM UTC
We have engaged our third-party partners and continue to diagnose the issue. Further updates will be provided as soon as possible. If you are experiencing issues, please contact our support teams quoting IM-582.
- resolved Jun 07, 2024, 10:26 AM UTC
Corrective actions have been deployed and following a period of monitoring, we've confirmed the resolution of this incident. If you are still experiencing issues, please contact our support teams quoting IM-582. Root cause analysis investigations have been initiated and an RCA will be provided.
- postmortem Jun 20, 2024, 06:27 AM UTC
# Event Description On Friday, 07th June 2024, between 08:45 UTC and 10:19 UTC, users in the EU Production environment experienced issues accessing the Ethics & Compliance cloud. During this period, users encountered "Invalid username or password" and "Server error in '/' application” errors. This outage resulted in significant disruption, preventing users from logging into the application. # Customer Impact Summary This incident directly impacted the functionality of the Ethics & Compliance cloud, leaving users unable to log in due to a database failure. # Findings and Root Cause Engineering teams observed unexpected writes to the TempLogs drive, which caused the storage capacity to be reached. Although multiple alerts were set up to notify the team about the storage issues, and our third-party SQL vendor, was informed and actively engaged, they lacked the necessary sysadmin access to fully investigate and address the underlying process causing the excessive writes. Despite our third-party SQL vendor’s inability to directly address the issue due to access limitations, they quickly escalated the matter to OneTrust. OneTrust intervened by restarting the server, which terminated the problematic processes and restored normal functionality to the database. # Mitigation To mitigate the impact and restore service, engineers restarted the Windows service on the affected server. This action cleared the blocked processes, allowing users to regain access to the Ethics & Compliance cloud. ### How could this incident have been avoided? This incident could have been prevented by granting sysadmin access to our third-party SQL vendor from the outset. This would have enabled them to directly investigate and resolve the issue, avoiding the need for escalation. ### How could we have detected the issue sooner? Setting lower alert thresholds would have provided the team with more time to react before the storage reached critical capacity. Is there a contingency or plan to control future incidents of this kind? To prevent similar incidents in the future, we will implement the following contingency measures: * Regularly review and adjust alert thresholds to ensure they are set at levels that provide adequate lead time for the team to respond. * Ensure that our third-party vendors have the necessary sysadmin rights to perform comprehensive investigations and immediate remediation of issues. # Corrective Actions Short-term * Restarted the Windows service on the affected server. Long-term * Review and adjust alert thresholds. * Implement measures to ensure third-party vendors have the necessary sysadmin rights.