Rippling incident

Issues loading Rippling

Rippling experienced a critical incident on September 2, 2025 affecting Rippling App, lasting —. The incident has been resolved; the full update timeline is below.

Started: Sep 02, 2025, 06:05 PM UTC
Resolved: Sep 02, 2025, 06:05 PM UTC
Duration: —
Detected by Pingoru: Sep 02, 2025, 06:05 PM UTC

Affected components

Rippling App

Update timeline

investigating Sep 02, 2025, 05:06 PM UTC

We are currently investigating this issue.
investigating Sep 02, 2025, 05:32 PM UTC

Single Sign-On to other apps is working but access to the Rippling app (app.rippling.com) is still unavailable
identified Sep 02, 2025, 05:40 PM UTC

The issue has been identified, and access to the Rippling app is starting to recover
monitoring Sep 02, 2025, 05:47 PM UTC

Access to Rippling has been restored. We will continue to monitor.
resolved Sep 02, 2025, 05:57 PM UTC

This incident has been resolved.
postmortem Sep 05, 2025, 11:25 PM UTC

## Summary _All times are in the US Pacific timezone._ On Tuesday September 2, 2025 from 9:35am to 10:39am, Rippling experienced a major outage that left users unable to access the Rippling application. The outage impacted most users on the platform. The outage was caused by two incompatible changes made in separate components of our system. Together, these changes triggered an infinite loop of database queries, overwhelming a critical database and dramatically increasing site latency. In an attempt to accelerate recovery, we enabled a database rate-limiting tool – meant to shed load from overloaded databases – which left the app unstable even after the original issue was resolved and prolonged the outage. Rippling takes site reliability seriously and we apologize for this incident. Below, we will explain in detail how the issue arose, how it bypassed our safety procedures, and the future steps we are taking to prevent similar issues from occurring. ‌ ## Background The scope of this incident spanned three separate subsystems: **Authentication** Rippling’s suite of products is built on a common set of components, including an authentication \(“auth”\) framework. Since the auth framework runs at the beginning of every API request, slowdowns in auth affect all products. The performance of the auth framework is critical for user experience. **Database ORM** An Object-Relational Mapper \(“ORM”\) is a tool that acts as a translator between application code and database queries, which often simplifies database access. Like many complex applications, Rippling uses an ORM library to manage queries to our databases, which we further customize to centralize database access patterns. **Feature flags** Rippling uses a feature flag system to safely deliver changes to production. The feature flag system allows us to selectively enable product features for specific users, and quickly rollout \(or rollback\) changes within seconds. The feature flag system automatically falls back to a safe default on errors. ‌ ## What happened On Aug 31, an Engineer made a change to our database ORM, adding an audit event for a specific data access pattern that we plan to change. This audit event was controlled by the feature flag system, introducing a new dependency not previously used in the ORM. This change was viewed as low-risk by code reviewers, in part because the feature flag system itself is a risk mitigation tool. The change was validated and deployed without issue. On Sept 2, a second Engineer made a change to the feature flag system, designed to improve feature rollout targeting among users. This second change added a new query to the auth database \(in a step called “build context”\). The second Engineer was not aware of the previous change or that these changes would be incompatible. This change was validated by our automated tests, but upon further review we discovered an error in the corresponding test which made it ineffective. When deployed, the new rollout targeting change invoked the previous ORM change. Because both components now referenced each other, the system entered an infinite loop and rapidly re-ran the new auth query due to the feature flag system’s automatic fallback. ![](https://app-static-content.s3.us-west-2.amazonaws.com/20250902-pm-infinite-loop.png) ‌We deploy all new versions of software at Rippling on a trial basis, known as our “canary”. The canary receives a small fraction of production traffic and automatically rolls back any new version of software with an elevated error _count_. This new version increased latency but decreased the number of requests, suppressing the error count and passing the error count test. Since our canary does not also test latency, the new version proceeded to a full deployment. Once fully deployed, the new software version generated more queries than the auth database could handle. Because the auth database is used widely across Rippling, latency increased quickly across all products. This latency made the application effectively unusable. ![](https://app-static-content.s3.us-west-2.amazonaws.com/20250902-pm-increased-load.png) Our Engineering team was able to quickly identify the issue and began to roll back the faulty deploy. The rollback process shares configuration with our standard production deployment. Because this process is optimized for safety, deployments \(including rollbacks\) are incremental and can be slow. Attempting to accelerate recovery, we enabled a database rate-limiting tool to shed load from the auth database. However, this change increased error rates and further destabilized the system. Engineers identified this secondary issue and disabled the rate-limiting, allowing the app to fully recover. ![](https://app-static-content.s3.us-west-2.amazonaws.com/20250902-pm-latency-and-errors.png) ‌ ## Critical infrastructure policy Rippling has a policy to avoid deploying critical infrastructure changes during peak business hours \(weekdays, 5am to 5pm\), intended to prevent faulty deployments from negatively affecting customers. This issue bypassed that policy due to a misunderstanding around automated enforcement. We pursue automation whenever possible – most parts of our deployment infrastructure are automatic. The second Engineer correctly recognized that their change affected critical infrastructure and believed our automated deployment process would schedule it accordingly. Unfortunately, the corresponding component was not marked as critical in our inventory and the change was deployed outside our policy window. ‌ ## Timeline | **Time** | **Description** | | --- | --- | | 8/31 - 11:32pm | The database ORM change is deployed. | | 9/02 - 9:35am | The feature flag change is deployed, triggering the looping database queries. | | 9/02 - 9:35am | Rippling application performance partially degrades, affecting users. | | 9/02 - 9:40am | Rippling application performance significantly degrades, affecting users. | | 9/02 - 9:41am | Monitoring system pages on-call engineers for an increase in HTTP 500 errors. | | 9/02 - 9:48am | An incident is created. | | 9/02 - 9:53am | The faulty deployment is identified and reverted. | | 9/02 - 10:15am | Rate-limiting mechanism on the auth database is enabled. | | 9:02 - 10:18am | The rollback deployment is completed and Rippling is running the last known good version. | | 9/02 - 10:30am | An engineer disables the database rate limiter. | | 9/02 - 10:39am | Rippling application performance returns to normal. | ‌ ## Fixes so far We have made immediate changes to our tools and policies: 1. We have temporarily paused production infrastructure changes. \(Future changes will pass through a more restrictive review process.\) 2. We have removed both the ORM and feature flag changes and they will not be re-introduced. 3. We have fixed the incorrect classification in our component inventory regarding the feature flag system. It is now marked as critical infrastructure. ‌ ## Remediations and follow-ups Based on our learnings from this incident, we’re introducing the following changes to our infrastructure, process, and code: 1. We will expand our canary testing and automated rollbacks to include better error and latency detection. 2. We will speed up our rollback strategy to minimize delays when restoring service. 3. We will improve our database rate-limiting tool and strategy to exempt critical workflows. 4. We will add checks to detect and gracefully halt similar infinite loops in sensitive components. 5. We will audit our component inventory to ensure our deployment system classifies all remaining critical systems correctly. ‌ ## Conclusions This incident lasted nearly an hour and had a wide impact on Rippling customers. Ultimately, our automations failed to adequately protect our production deployment. Again, we sincerely apologize for the disruption. We have begun the remediation steps above and will continue to address them urgently. As always, we will keep investing in our platform to further increase the reliability of our product.