Clerk.io incident

Intermittent failures to se...

Clerk.io experienced a minor incident on September 11, 2025 affecting API (api.clerk.io) and Services (Search) and 1 more component, lasting 7h 4m. The incident has been resolved; the full update timeline is below.

Started: Sep 11, 2025, 07:03 AM UTC
Resolved: Sep 11, 2025, 02:07 PM UTC
Duration: 7h 4m
Detected by Pingoru: Sep 11, 2025, 07:03 AM UTC

Affected components

API (api.clerk.io)Services (Search)Services (Recommendations)

Update timeline

investigating Sep 11, 2025, 07:03 AM UTC

We are currently experiencing intermittent failures to requests for search and recommendations. Engineering is working to restore regular service. More information and RCA will follow.
resolved Sep 11, 2025, 02:07 PM UTC

Post Mortem - 2025-09-11 - Intermittent API failures All times are CEST. Overview Earlier today logging, search, and recommendation API endpoints began exhibiting intermittent failures, not returning any results. We’re sorry for the disruption this caused to you and your customers. Impact Affected services: logging, search (v2), and recommendation (v2) APIs Symptoms: intermittent errors responses and missing results Windows: 08:52–09:25 CEST and ~12:20 (brief) Timeline 08:19 - A slow increase in incoming live updates (and CRUD requests) begin. The increase is masked by the rise of requests that happen every morning. 08:52 - Our inter-service messaging and event system reaches its memory limit, causing the first failed requests. As messages are consumed, capacity is freed up, leading to the intermittent nature of the issue. 09:02 - The issue is confirmed in monitoring and SRE begins investigating. 09:10 - Engineering confirms the issue in the messaging and event system and begins raising resources to increase event processing. 09:25 - Issue is resolved and intermittent failures cease. 12:20 - Another severe spike in incoming live updates cause another few minutes of intermittent request failures. Root Cause Analysis The root cause was the lack of available resources in the messaging and event system. A transient increase in load from live-updates led to it quite simply hitting a hardware limit and stopped accepting new events. Parts of the API rely on this system to perform session-, message-, user activity-, and usage-logging. With the system refusing new messages and events these parts of the API failed. Remediation Several avenues of remediation are being pursued, most have already been implemented: 1. Completed: The parts of the API that rely on being able to hand over events to the messaging and event system have been modified to 'degrade gracefully'. That means that in the case our infrastructure has issues, you will still get search results and recommendations back from us, instead we disable the tracking of usage and other logging events. 2. Completed: Increased resources for the messaging and event system. We have increased the amount of resources in the system eight-fold (8x), allowing for the absorption of much larger transient loads. 3. Planned for tonight: Separation of specific event types. We will be separating some of the event types out to another more resilient system, isolating it from large spikes in live updates and increasing its durability from hardware errors. We sincerely apologize for the inconvenience this has caused not just to you, but your customers and want to reassure you that we are not done looking into ways we can further strengthen our systems.