Gainsight incident

Gainsight CS - EU - Elevated errors in NXT Authentication

Major Resolved View vendor source →

Gainsight experienced a major incident on May 28, 2024 affecting Gainsight CS EU Application, lasting 2h 23m. The incident has been resolved; the full update timeline is below.

Started
May 28, 2024, 06:46 AM UTC
Resolved
May 28, 2024, 09:10 AM UTC
Duration
2h 23m
Detected by Pingoru
May 28, 2024, 06:46 AM UTC

Affected components

Gainsight CS EU Application

Update timeline

  1. investigating May 28, 2024, 06:46 AM UTC

    We are investigating errors while logging into the Gainsight NXT application. We will post updates as soon as they are available.

  2. investigating May 28, 2024, 07:31 AM UTC

    Gainsight NXT application is still down. We are working with the upstream service provider. We will post updates as soon as they are available.

  3. identified May 28, 2024, 08:08 AM UTC

    The issue has been identified and fix is being implemented.

  4. monitoring May 28, 2024, 08:45 AM UTC

    Fix is implemented and all services are back to normal. The queues are also released and the jobs will catchup in next couple of hours. We are monitoring closely

  5. monitoring May 28, 2024, 08:45 AM UTC

    We are continuing to monitor for any further issues.

  6. resolved May 28, 2024, 09:10 AM UTC

    This incident has been resolved.

  7. postmortem Jun 03, 2024, 07:14 AM UTC

    **Incident Summary for issue on 28 May 2024 \(External\)** **Gainsight CS - EU - Elevated errors in NXT Authentication** On **2024-05-28** between **07:31 and 08:45 UTC**, users of the Gainsight Application in the CS EU Cloud experienced intermittent application availability issues. The Gainsight UI was inaccessible for approximately 75 minutes during this window. **Root Cause :** Investigations have identified the following cause of the incident: * An infrastructure component, specifically the backend worker service \(Kubernetes Karpenter\), was upgraded to a newer version to patch critical security and other updates. * This change had already been successfully executed in the STAGE and other PROD environments. * During the EU environment upgrade, all metadata configurations were transferred except for one critical rule. * The missing rule allowed for UDP communication to DNS Servers. * Due to the absence of this rule, DNS requests could not be resolved, causing microservices on newly provisioned worker nodes to fail. Microservices on older worker nodes were unaffected. * These failures resulted in a significant number of stale threads/connections in a short time frame, rendering the API Gateway unresponsive. * Updating the missing rule in the Network Security Group and reprovisioning the worker nodes resolved the issue. * Pending rule jobs were either skipped or resubmitted as necessary. ‌ **Recovery Action :** 1. Updated the missing UDP rule in the Network Security Group. 2. Restarted all affected services. ‌ **Preventive Measures:** 1. Ensure network rules consistency before and after any upgrade – this process has been initiated. 2. Schedule critical security updates and even low-risk infrastructure changes during non-peak hours, despite previous successes in other environments, to minimize impact.