Reown incident

Widespread Outage

Critical Resolved View vendor source →

Reown experienced a critical incident on September 16, 2023, lasting —. The incident has been resolved; the full update timeline is below.

Started
Sep 16, 2023, 11:49 AM UTC
Resolved
Sep 16, 2023, 04:00 AM UTC
Duration
Detected by Pingoru
Sep 16, 2023, 11:49 AM UTC

Update timeline

  1. resolved Sep 16, 2023, 11:49 AM UTC

    An API we leverage for API key management was down and response times surged. This affected many upstream services consuming this API. We will share a postmortem soon.

  2. postmortem Sep 19, 2023, 01:02 PM UTC

    **TL;DR** Due to a bug in our caching policy we hit CPU limits on Supabase Cloud DB prod taking down an authentication database causing partial outages for all customers. **Events** * The DDoS attack lasted ~45 minutes, starting at 04:45 AM CET and ending at 5:30 AM CET. * Relay kept Supabase busy until 07:45 AM CET * Total downtime was 3 hours * Ivan found the issue, Ilja and Tom jumped in and requested Chad to increase our CPU limits * Impact * Unhealthy relay * Cloud app down * Web3Modal requests timing out **Root Cause** DDOS attack where the load was evenly distributed across VPS providers focusing on a single route `/w3m/v1/getMobileListings` and circumventing our cache policy by appending new query params. The Explorer API was hit ~21M times in 15 minutes. 1. More context below 👇 * Same `projectId` used across IPs that were involved in the DDoS * IPs flagged as threat/proxy/anonymizer by Cloudflare * Coordinated at the same time across different servers and regions * Focused on a single route 2. Supabase Cloud DB got overwhelmed and hit CPU limits 3. Relay was in a retry loop because of queries timing out, keeping Supabase Cloud DB CPU limits at its max. **What could we have done better?** 1. Escalation path wasn’t clear enough \(eg: Ilja didn’t know how to escalate or page other folks\) * Override to a team member when on-call person is unavailable \(eg: flight, etc.\) * Non-Rota individual should have enough OpsGenie access to trigger “Escalate to All” button in case of a repeat scenario * Rate-limiting on Explorer API ### **Action items** **Short term** * \[x\] \[Cali\] Publish COE \(depends on the summary being more up to date\) * \[x\] \[Cali\] Query param validation * \[ \] \[Derek\] Write a guide on how to page on-call * \[ \] \[Xav to find owner\] 2nd Layer Cache for Cerberus * \[ \] \[Cali\] Configure query timeout on Supabase client if possible **Mid-/long-term** * \[ \] Stricter rate limiting * \[ \] scale down Supabase \(schedule with Postgres migration\) * \[ \] Create a read replica \(look into [https://supabase.com/docs/guides/database/replication](https://supabase.com/docs/guides/database/replication)\) * \[ \] Red-teaming our externally-facing services