Tonkean incident

Slowness in handling of Tonkean actions

Minor Resolved View vendor source →

Tonkean experienced a minor incident on September 29, 2025 affecting Workflows Runtime and User Interfaces (Forms, Item Interfaces, Workspace Apps, Business Reports) and 1 more component, lasting 1h 28m. The incident has been resolved; the full update timeline is below.

Started
Sep 29, 2025, 06:53 PM UTC
Resolved
Sep 29, 2025, 08:21 PM UTC
Duration
1h 28m
Detected by Pingoru
Sep 29, 2025, 06:53 PM UTC

Affected components

Workflows RuntimeUser Interfaces (Forms, Item Interfaces, Workspace Apps, Business Reports)Workflow Runtime History

Update timeline

  1. investigating Sep 29, 2025, 06:53 PM UTC

    We are experiencing slowness in execution of actions. We're looking into it.

  2. resolved Sep 29, 2025, 08:21 PM UTC

    This incident has been resolved.

  3. postmortem Sep 30, 2025, 04:38 PM UTC

    # Root Cause Analysis – Infrastructure Upgrade Tagging Incident We completed an infrastructure upgrade on September 28, 2025, and switched traffic to new clusters. The system initially operated normally, but a required subnet tag used by our autoscaler was inadvertently removed during cleanup. The issue was not visible because existing nodes remained active until workloads triggered autoscaling. At that point, new nodes could not be provisioned, and some services gradually scaled down to zero replicas. Customers began to experience degraded performance, including delayed event handling and responsiveness. The disruption began on September 29 at 15:00 UTC. Internal monitoring flagged degradation at 15:35 UTC, but on-call engineers were not immediately alerted. Customers reported issues shortly thereafter, prompting an investigation. At 18:05 UTC, the missing subnet tag was identified as the cause. Restoring it at 18:18 UTC immediately allowed node scheduling to resume, and by 18:26 UTC, all services were back to normal. The total duration of degraded service was approximately three hours. No data loss occurred. The root cause was removing the shared subnet tag that the Kubernetes autoscaler \(Karpenter\) required. Without it, autoscaling failed once triggered. To mitigate, we restored the tag, validated node provisioning, and confirmed services recovered fully. To prevent recurrence, we are adding validation steps before and after infrastructure changes, new alerts for unscheduled pods and node provisioning failures, and safeguards to ensure critical services cannot scale below safe thresholds. We are also updating our Infrastructure-as-Code and playbooks to protect shared dependencies better. We apologize for the disruption and are committed to ensuring the reliability and resilience our customers expect.