Convercent incident

Convercent platform Service Outage

Convercent experienced a major incident on January 6, 2024 affecting US Trial and EU Trial and 1 more component, lasting 27m. The incident has been resolved; the full update timeline is below.

Started: Jan 06, 2024, 05:22 PM UTC
Resolved: Jan 06, 2024, 05:50 PM UTC
Duration: 27m
Detected by Pingoru: Jan 06, 2024, 05:22 PM UTC

Affected components

US TrialEU TrialEU ProductionUS Production

Update timeline

investigating Jan 06, 2024, 05:22 PM UTC

We’ve identified an issue where users are experiencing login failures with the error "invalid user name password". The impact is affecting all users accessing the service through the EU and U.S. production and Trial environments. We’re investigating the issue and will provide an update as soon as possible.
resolved Jan 06, 2024, 05:50 PM UTC

The issue has been resolved. If you are still experiencing issues, please contact our support teams quoting IM-368. Root cause analysis investigations have been initiated and an RCA will be provided.
postmortem Jan 29, 2024, 08:07 AM UTC

# Event Description Between Saturday 06th and Monday 08th January 2024, customers using the OneTrust Convercent platform experienced ‘Invalid username and password’ notifications when attempting to access their production and trial environments, impact extended to new case creation that could not be reported due to submission failures. # Findings and Root Cause Upon engagement, engineering teams identified the root cause as a missing configuration with a recent certificate update. The configuration was updated which resolved the incident for customers that were not using SSO encryption. Customers that did have SSO encryption enabled were required to update certificate metadata and were contacted by OneTrust support teams to assist them with this process. There are three certificates that require regular updates. The process relies upon manual intervention and one of the certificate configurations was overlooked during the update process. ‌ **How could this incident have been avoided?** Automating the certificate update process or the creation of a checklist if a manual process is continued. ‌ **How could we have detected the issue sooner?** There are proactive alerts in place, however, these were not executed due to a configuration issue. ‌ **Is there a contingency or plan to control future incidents of this kind?** We are exploring the possibility of automating the process. Until such time, a checklist has been created to assist engineers with the process. ‌ **If related to a change, why was it not discovered in testing?** The infrastructure certificate updates are not subjected to testing. # Corrective Actions * Update the missing configuration on the certificate * Add checklist to be used for expired certification rotations