Kobiton incident

Private devices stay in utilizing state

Kobiton experienced a major incident on March 31, 2025 affecting Private Cloud Devices, lasting 16m. The incident has been resolved; the full update timeline is below.

Started: Mar 31, 2025, 06:54 PM UTC
Resolved: Mar 31, 2025, 07:11 PM UTC
Duration: 16m
Detected by Pingoru: Mar 31, 2025, 06:54 PM UTC

Affected components

Private Cloud Devices

Update timeline

investigating Mar 31, 2025, 06:54 PM UTC

We are currently investigating an issue with devices remaining in "Utilizing" state after a session ends.
investigating Mar 31, 2025, 07:04 PM UTC

We are continuing to investigate this issue.
resolved Mar 31, 2025, 07:11 PM UTC

This incident has been resolved. Root cause is forthcoming.
postmortem May 06, 2025, 08:18 PM UTC

**1: What happened** A single `device-connector` pod in the US Prod cluster reached **100% CPU saturation** and stopped servicing gRPC traffic. Because the pod’s existing _gRPC_ health probe continued to return `OK`, Amazon EKS left the instance in the target group and the load balancer routed traffic to it. All requests that landed on that pod stalled, producing time-outs for a subset of customers. **2: Impact** * Device bookings, passcode processing, cleanups and other device administrative processing tasks were delayed. * Automation jobs that required an available device were blocked. * Manual testing sessions already in progress were **not** interrupted. * No data corruption occurred. **3: Timeline \(UTC\)** | Time | Event | | --- | --- | | 15:32 | First customer report of devices not returning to `Available` after end of session. | | 17:09 | On-call engineer identifies one pod pegged at 100% CPU. | | 17:52 | Evicted pod deleted. | | 17:57 | Devices return online. | | 18:12 | Devices launching successfully. Incident marked as `RESOLVED` | | 19:12 | 1-hour watch period passes with no recurrence. | **4: Root cause** A service that receives admin messages from individual device connect instances was responding to gRPC health probes so the pod remains healthy in EKS. However, the pod in EKS was not able to process new requests. Requests routed to the pod hang until they timeout after 30 seconds. **5: Immediate resolution** * Manually deleted the faulty pod; Kubernetes deployed a fresh replica. * Confirmed consumer and producer offsets were fully caught up; no message loss. **6: Follow-up / preventive actions** | Action | Target release | | --- | --- | | Add an **HTTP health probe** to check pod health. | 4.19 \(April 26, 2025\) | | **Thread analysis** Profile CPU component to identify any issues with memory management. | Completed. None found. |