Gainsight incident
Gainsight CS - EU - Elevated errors in NXT Authentication
Gainsight experienced a major incident on May 28, 2024 affecting Gainsight CS EU Application, lasting 2h 23m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating May 28, 2024, 06:46 AM UTC
We are investigating errors while logging into the Gainsight NXT application. We will post updates as soon as they are available.
- investigating May 28, 2024, 07:31 AM UTC
Gainsight NXT application is still down. We are working with the upstream service provider. We will post updates as soon as they are available.
- identified May 28, 2024, 08:08 AM UTC
The issue has been identified and fix is being implemented.
- monitoring May 28, 2024, 08:45 AM UTC
Fix is implemented and all services are back to normal. The queues are also released and the jobs will catchup in next couple of hours. We are monitoring closely
- monitoring May 28, 2024, 08:45 AM UTC
We are continuing to monitor for any further issues.
- resolved May 28, 2024, 09:10 AM UTC
This incident has been resolved.
- postmortem Jun 03, 2024, 07:14 AM UTC
**Incident Summary for issue on 28 May 2024 \(External\)** **Gainsight CS - EU - Elevated errors in NXT Authentication** On **2024-05-28** between **07:31 and 08:45 UTC**, users of the Gainsight Application in the CS EU Cloud experienced intermittent application availability issues. The Gainsight UI was inaccessible for approximately 75 minutes during this window. **Root Cause :** Investigations have identified the following cause of the incident: * An infrastructure component, specifically the backend worker service \(Kubernetes Karpenter\), was upgraded to a newer version to patch critical security and other updates. * This change had already been successfully executed in the STAGE and other PROD environments. * During the EU environment upgrade, all metadata configurations were transferred except for one critical rule. * The missing rule allowed for UDP communication to DNS Servers. * Due to the absence of this rule, DNS requests could not be resolved, causing microservices on newly provisioned worker nodes to fail. Microservices on older worker nodes were unaffected. * These failures resulted in a significant number of stale threads/connections in a short time frame, rendering the API Gateway unresponsive. * Updating the missing rule in the Network Security Group and reprovisioning the worker nodes resolved the issue. * Pending rule jobs were either skipped or resubmitted as necessary. **Recovery Action :** 1. Updated the missing UDP rule in the Network Security Group. 2. Restarted all affected services. **Preventive Measures:** 1. Ensure network rules consistency before and after any upgrade – this process has been initiated. 2. Schedule critical security updates and even low-risk infrastructure changes during non-peak hours, despite previous successes in other environments, to minimize impact.