Rippling incident

Issues accessing Rippling

Rippling experienced a critical incident on July 21, 2025 affecting Rippling App, lasting 2h 18m. The incident has been resolved; the full update timeline is below.

Started: Jul 21, 2025, 04:08 PM UTC
Resolved: Jul 21, 2025, 06:26 PM UTC
Duration: 2h 18m
Detected by Pingoru: Jul 21, 2025, 04:08 PM UTC

Affected components

Rippling App

Update timeline

investigating Jul 21, 2025, 04:08 PM UTC

We are currently investigating this issue.
investigating Jul 21, 2025, 04:27 PM UTC

We are continuing to investigate this issue.
investigating Jul 21, 2025, 04:51 PM UTC

We are continuing to investigate this issue.
identified Jul 21, 2025, 05:02 PM UTC

The issue has been identified and a fix is being implemented.
identified Jul 21, 2025, 05:10 PM UTC

We are continuing to work on a fix for this issue.
identified Jul 21, 2025, 05:20 PM UTC

We are continuing to work on a fix for this issue.
monitoring Jul 21, 2025, 05:20 PM UTC

A fix has been implemented and we are monitoring the results.
investigating Jul 21, 2025, 05:36 PM UTC

We are currently investigating this issue.
investigating Jul 21, 2025, 05:59 PM UTC

We are continuing to investigate this issue.
resolved Jul 21, 2025, 06:26 PM UTC

This incident has been resolved.
postmortem Jul 25, 2025, 02:57 PM UTC

## Overview Rippling experienced a major platform outage on Monday Jul 21, 2025 from 9:00 AM to 11:22 AM PDT. A system misconfiguration that led to a critical database becoming overloaded during a peak traffic window caused the outage. The overloaded database resulted in slow load times, errors, and widespread inaccessibility of Rippling and integrated third-party apps for users. The root cause was an increase in the load on one of our core databases during peak morning traffic. As this database stores data required by almost all Rippling pages, the increased load led to slow page loads and customer outages. In the operational incident that we created to mitigate this situation, we identified that the increased load was due to a misconfiguration in a system which connected to this database. Due to the specific nature of the misconfiguration, the database issue was not visible in our operational dashboards until we encountered the peak traffic load. We mitigated the issue by disabling the connection that increased the load on the database, which restored its performance and brought the site back up. Since then, we’ve taken immediate steps to improve monitoring, correct the misconfiguration, and put safeguards in place to prevent similar issues going forward. ## Background Rippling operates a system \(known as "CDC v1"\) that replicates data from one of our core transactional databases \("the database"\) to our analytical data store. Over the past 18 months, we developed a new version of this system \(known as "CDC v2"\) as a more scalable, reliable and performant alternative. Both CDC v1 and v2 are intended to be connected to read from replica instances of the database. These read-only replicas automatically mirror data that is written to the primary read-write database instance. This configuration is intended to prevent the CDC systems from placing unacceptable load on the primary database and in order to reduce the risk of its workload degrading the primary database availability or performance. This approach aligns with widely accepted industry best practices for operating production databases at scale. ## Timeline | **Timestamp** | **Event** | **Elapsed time \(min\)** | | --- | --- | --- | | July 18 2025 12:46 PM PDT | Additional datasets enabled on CDC v2 | N/A | | July 21 2025 08:58 AM PDT | p50 overall request latency starts increasing above baseline | 0 | | July 21 2025 09:00 AM PDT | p50 overall request latency degraded | 2 | | July 21 2025 09:05 AM PDT | Incident created | 7 | | July 21 2025 09:14 AM PDT | Identified first background workloads to safely disable | 16 | | July 21 2025 10:17 AM PDT | CDC v2 workloads removed from primary database | 79 | | July 21 2025 10:38 AM PDT | p50 overall request latency recovered and site degradation mitigated | 100 | ## Root cause A misconfigured data processing job that connected directly to the primary production database, rather than the intended read replicas, caused the incident. This workload was deployed on Friday Jul 18, 2025 at 12:46 PM PDT, adding unexpected load to the primary database. The misconfiguration went undetected due to it not having any observable impact on the database and the misconfiguration not presenting in the monitored telemetry. An interaction between a bug in the CDC v2 system and new datasets that were enabled for replication on Friday amplified the load the CDC v2 system placed on the primary database. As peak Monday morning traffic ramped up, the combined workload exceeded the headroom the primary database normally had available to serve peak workloads. This led to degraded platform performance and availability. ## Impact Between 9:00 AM and 11:22 AM PDT on Monday Jul 21, 2025 the Rippling platform experienced a major outage that resulted in access to Rippling and some third-party apps to be intermittently to fully unavailable. Users encountered elevated page loads and page crashes, especially during peak usage at the top of the hour. ## Resolution Upon detecting the database degradation at 9:08 AM PDT, the incident response team immediately declared a SEV-1 incident and began investigating the root cause. The team identified that background data processing jobs were placing excessive load on the primary database. They identified the cause as a misconfiguration which resulted in direct connections to the primary database instead of routing to read replicas as intended. To mitigate the impact and restore stability, the following actions were taken: * The problematic background jobs were disabled to reduce load on the primary database * The misconfigured connections were explicitly closed in the processing jobs to prevent excessive and lingering open connections * An audit of data processing jobs was performed to identify and pause any other jobs that might cause similar issues * An audit of observability gaps was performed to enable faster detection of this kind of issue in the future * Database load was continuously monitored and verified to return to normal levels, with full operational status restored by 11:22 AM PDT * The state of the system was proactively monitored during peak demand windows for 24 hours after the incident was formally mitigated Following the incident, we have implemented the following additional safeguards: * Reduced the time connections to our primary database are kept open to prevent long-lived idle connections from consuming headroom on the database * Modified CDC v2 data processing workloads so that it opens fewer concurrent connections, thereby reducing the load it places on the database * Updated the CDC v2 configuration to always connect to the database read replicas only * Improved the observability of our primary database to allow us to more quickly isolate which workloads are placing a high load on the database * We migrated workloads \(which could be safely migrated\) from the primary database to replica instances ## Action Items We are committing to the following immediate action items: * Improve the observability of our databases so that we can more quickly identify situations where our primary database is operating without enough headroom * Improve the observability of the CDC v2 system to ensure that the types of malfunctions that contributed to this incident are detected and alerted upon * Implement solutions that would allow us to quickly disable workloads to shed load when experiencing a degraded database state * Distribute background workloads better over time so as to not coincide with peak user demand cycles * Distribute workloads between different primary databases to reduce single points of congestion * Institute a zero-tolerance policy for direct connections to the primary database that are not mediated through a system capable of modulating and controlling the impact of these systems on the database * Increase the overall system's headroom and validate it on an ongoing basis through continual load testing