Linode incident

Service Issue: RTX 4000 Ada GPU Errors Across Multiple Regions

Minor Resolved View vendor source →
Started
Mar 05, 2026, 02:07 AM UTC
Resolved
Mar 05, 2026, 06:11 PM UTC
Duration
16h 4m
Detected by Pingoru
Mar 05, 2026, 02:07 AM UTC

Affected components

US-ORD (Chicago) Linode Kubernetes EngineUS-SEA (Seattle) Linode Kubernetes EngineJP-OSA (Osaka) Linode Kubernetes EngineSG-SIN-2 (Singapore 2) Linode Kubernetes Engine

Update timeline

  1. investigating Mar 05, 2026, 02:07 AM UTC

    We are investigating a critical service issue affecting NVIDIA RTX 4000 Ada GPU nodes across multiple regions, including Osaka (osa1), Seattle (sea1), and Chicago (ord1). Affected GPU nodes may report an unrecoverable error state leading to failures in Vulkan initialization and GPU-accelerated workloads. Additionally, some LKE clusters in the Osaka region are currently experiencing Control Plane connectivity issues, resulting in timed-out API requests and errors. Our engineering teams are currently investigating the root cause, focusing on a potential regression in the underlying host hypervisor or GPU firmware. We will provide more information as it becomes available

  2. investigating Mar 05, 2026, 05:48 AM UTC

    Our subject matter experts are actively investigating the issue. We will provide the next update as progress is made.

  3. investigating Mar 05, 2026, 06:55 AM UTC

    We are continuing to investigate the issue. We will provide the next update as progress is made.

  4. monitoring Mar 05, 2026, 07:34 AM UTC

    Our team has identified the issue affecting the service and implemented a fix. We will be monitoring this to ensure that it remains stable. If you continue to experience problems, please open a Support ticket for assistance.

  5. investigating Mar 05, 2026, 10:50 AM UTC

    We are aware of a recurrence of this issue across multiple regions. We are continuing to investigate and will provide the next update as progress is made.

  6. investigating Mar 05, 2026, 02:48 PM UTC

    We are continuing to investigate and will provide the next update as progress is made.

  7. identified Mar 05, 2026, 04:23 PM UTC

    Our team has identified the issue affecting the service. We are working quickly to implement a fix, and we will provide an update as soon as the solution is in place.

  8. monitoring Mar 05, 2026, 05:01 PM UTC

    At this time we have been able to correct the issues affecting the service. We will be monitoring this to ensure that it remains stable. If you continue to experience problems, please open a Support ticket for assistance.

  9. resolved Mar 05, 2026, 06:11 PM UTC

    We haven’t observed any additional issues with the service, and will now consider this incident resolved. If you continue to experience problems, please open a Support ticket for assistance.

  10. postmortem Mar 09, 2026, 01:32 AM UTC

    Starting at approximately 21:00 UTC on March 4, 2026, customers utilizing NVIDIA RTX 4000 Ada GPU-backed Linodes began experiencing lockups. At first, the issue was believed to be isolated to worker nodes on the Linode Kubernetes Engine \(LKE\) platform, but was later confirmed to impact all Linodes using this hardware. Standard Compute and non-RTX4000 GPU instances were unaffected. After ruling out recent software releases, our subject matter experts isolated the root cause to a recently deployed telemetry script. During a routine system improvement initiative, our teams identified and repaired a broken, legacy monitoring script to restore a missing metric on our internal observability dashboards. While investigating why a GPU monitoring script stopped reporting correct metrics, an update was made to restore it to a working state. The script, originally written for an earlier GPU generation, issued a firmware inspection query that was not apparent from the scope of the fix being made. On the RTX 4000 Ada architecture, this class of query against an active GPU triggers a race condition in the GPU System Processor \(GSP\), causing the GPU to enter a protective lockup state and become unavailable to running workloads. We disabled the monitoring script across the GPU fleet and rebooted the nodes to mitigate the impact. The issue was fully mitigated around 17:16 UTC on March 5, 2026. We sincerely apologize for the disruption this caused to your GPU-accelerated applications and services. We will take appropriate improvement measures and prevention actions to prevent recurrence. This summary provides an overview of our current understanding of the incident given the information available. Our investigation is ongoing, and any information herein is subject to change.

Looking to track Linode downtime and outages?

Pingoru polls Linode's status page every 5 minutes and alerts you the moment it reports an issue — before your customers do.

  • Real-time alerts when Linode reports an incident
  • Email, Slack, Discord, Microsoft Teams, and webhook notifications
  • Track Linode alongside 5,000+ providers in one dashboard
  • Component-level filtering
  • Notification groups + maintenance calendar
Start monitoring Linode for free

5 free monitors · No credit card required