Amity incident

Partially Degraded Performance [US region]

Amity experienced a major incident on January 21, 2026 affecting Core Services (US), lasting 1h 58m. The incident has been resolved; the full update timeline is below.

Started: Jan 21, 2026, 02:00 PM UTC
Resolved: Jan 21, 2026, 03:58 PM UTC
Duration: 1h 58m
Detected by Pingoru: Jan 21, 2026, 02:00 PM UTC

Affected components

Core Services (US)

Update timeline

investigating Jan 21, 2026, 03:35 PM UTC

We're investigating an alert on our social and realtime performance. You may experience delay in social and realtime connection or response time.
monitoring Jan 21, 2026, 03:37 PM UTC

A fix has been implemented and we are monitoring the results.
resolved Jan 21, 2026, 03:58 PM UTC

This incident has been resolved.
postmortem Jan 22, 2026, 02:32 PM UTC

**Incident Date:** 2026-01-21 **Impact:** System degradation and intermittent downtime. **Primary Cause:** Infrastructure **resource exhaustion** triggered by an unprecedented high-volume traffic surge. ## 1. Summary On January 21, an unprecedented surge in traffic, peaking at **450,000 requests per minute \(~56.25x baseline\)**. While application servers autoscaled successfully, the Core Database became the bottleneck. Despite two manual vertical scaling interventions, the system experienced two periods of degradation before stabilizing as database capacity finally matched the demand. ## 2. Root Cause The root cause of the incident was **infrastructure resource exhaustion** resulting from insufficient database overhead to accommodate a sudden traffic spike. * **Traffic Volume:** An unprecedented surge in external demand drove platform traffic significantly beyond predicted growth, increasing from a baseline of **8,000 req/min** to a peak of **450,000 req/min \(a ~56.25x increase\)**. * **Scaling Operation Time:** Vertical scaling of the Core Database required a **10–30 minute operation time** per event. During these intervals, the system remained degraded as incoming demand outpaced both available capacity and recovery speed. ## 3. Optimizations & Corrective Actions Based on the investigation, we will implement the following technical safeguards: #### **A. Transition Impacted Queries to Secondary Nodes** * **Action:** Reconfigure remaining database queries to target Secondary \(Read\) Replicas rather than the Primary node. * **Goal:** Offload significant pressure from the Primary database. By reducing the load on the Primary node, we ensure it retains enough resource overhead to improve the scaling and recovery time. This prevents the Primary from being choked by contention, allowing it to complete vertical scaling operations much faster during a surge. #### **B. Optimize Autoscaling Performance \(Server & Database\)** * **Action:** Review and tune autoscaling policies for both the App Tier and Database Tier to specifically reduce operation time. * **Goal:** Decrease the "Time-to-Ready" for new resources. By optimizing scaling triggers and resource warm-up procedures, we ensure capacity is provisioned more rapidly, improving the system's overall recovery time during a sudden spike.