Thycotic incident

Secret Server Cloud: EU - Intermittent failures with API calls and Launching Secrets

Minor Resolved View vendor source →

Thycotic experienced a minor incident on May 11, 2026 affecting Secret Server Cloud, lasting 10h 48m. The incident has been resolved; the full update timeline is below.

Started
May 11, 2026, 03:28 PM UTC
Resolved
May 12, 2026, 02:17 AM UTC
Duration
10h 48m
Detected by Pingoru
May 11, 2026, 03:28 PM UTC

Affected components

Secret Server Cloud

Update timeline

  1. monitoring May 11, 2026, 03:28 PM UTC

    As of 13:37 UTC, the degraded performance affecting Secret Server Cloud in the EU region has been resolved. Customers who experienced failures launching secrets or intermittent API errors should no longer be impacted. Our team is conducting a root cause analysis. We will post a follow-up update once findings are available. We apologize for any disruption this caused. ----------------------------------------------------------------------------------------------------------------------- Investigating — May 11, 2026, 07:57 UTC We are investigating reports of degraded performance affecting Secret Server Cloud in the EU region. Some users may be unable to launch secrets. We will provide an update as soon as more information is available.

  2. resolved May 12, 2026, 02:17 AM UTC

    As of 13:37 UTC on May 11, 2026, the Intermittent failures launching Secrets in the EU region has been resolved. Our preliminary investigation determined that the root cause was an outage impacting a cloud infrastructure service used by Secret Server Cloud. Normal service has been confirmed restored. We are continuing to work with our cloud provider to obtain full root cause details and identify preventative actions. We apologize for the impact to your experience and appreciate your patience while we investigated.

  3. postmortem May 20, 2026, 06:09 AM UTC

    **Incident Overview** On May 12, 2026, starting at 07:53 UTC, Secret Server Cloud customers in the EU region experienced intermittent failures when launching secrets, initiating proxied RDP/SSH sessions, and making API calls requiring distributed engine communication. The incident was traced to a degradation in the underlying cloud messaging infrastructure in the West Central Europe region. At 13:37 UTC, the degraded performance affecting our services was fully resolved and normal operations were restored. The impact was limited to SSC customers with Distributed Engines. Secret viewing, management, and Web UI availability remained unaffected. **Root Cause** A degradation in the cloud messaging infrastructure in the West Central Europe region caused message subscription management operations to return HTTP 504 Gateway Timeout errors, preventing Distributed Engines from completing initialization and taking them offline. This resulted in timeouts across all distributed engine-routed operations, most visibly secret launches and proxied session initiations. The failure was isolated to the control plane layer of the messaging infrastructure. TCP-level connectivity remained healthy throughout the incident, and the issue was not attributed to any network or configuration change on our side. The issue was mitigated by our Cloud provider rolling back a recent release on the messaging infrastructure that had contributed to the control plane failures. **Preventive Actions** * Expand monitoring coverage for cloud messaging exception rates and Distributed Engine subscription failure patterns to enable proactive detection ahead of customer impact. * Review integration of Cloud provider health notifications into our on-call alerting pipeline to improve visibility into infrastructure events affecting Secret Server Cloud regions. * Assess improvements to Distributed Engine startup and reconnection logic to introduce retry handling with exponential back-off on transient messaging failures, reducing the risk of short-lived disruptions escalating into sustained engine outages. **Lessons Learned** The duration of customer impact during this incident was extended by gaps in our operational response. Specifically: * Limited visibility into cloud provider health events delayed our awareness of the underlying infrastructure degradation, and we did not follow our standard operating procedure to escalate with our vendor in a timely manner. * Acknowledgment of the incident on our status page was delayed, deviating from our standard incident communication process. * This incident reinforced the importance of continual improvements in both our monitoring and situational awareness of our infrastructure, as well as in our engineer training and development. We apologize for the extended impact our handling of this incident had on our customers and on their operations. We continue to take our responsibilities to our customers seriously, and have taken lessons from the handling of this incident to strengthen our processes going forward.