Restaurant365 incident

Delayed DSS importing

Restaurant365 experienced a notice incident on March 17, 2025 affecting POS Integrations (DSS Polling), lasting 2h 56m. The incident has been resolved; the full update timeline is below.

Started: Mar 17, 2025, 04:20 PM UTC
Resolved: Mar 17, 2025, 07:17 PM UTC
Duration: 2h 56m
Detected by Pingoru: Mar 17, 2025, 04:20 PM UTC

Affected components

POS Integrations (DSS Polling)

Update timeline

identified Mar 17, 2025, 04:20 PM UTC

We are currently seeing degraded performance which is causing delays with DSS imports. While we continue to investigate the root cause of the issue, we will continue to provide updates.
monitoring Mar 17, 2025, 04:29 PM UTC

We are now seeing recovery after implementing a fix. We will continue to monitor and provide additional updates.
resolved Mar 17, 2025, 07:17 PM UTC

As we continue to see recovery, we will now considered this to be resolved.
postmortem Apr 01, 2025, 02:33 PM UTC

`Root Cause Analysis (RCA) - Incident Title: ` `Missing/Delayed DSS ` `Date: 3/14/2025 - 3/17/2025 ` ` ` 1. Summary `› Incident Overview: Between March 14 and March 17, 2025, multiple related incidents disrupted the DSS ` `(Daily Sales Summary) used by customers to view sales and labor data. On March 13, a new pipeline for the ` `Comida Director was deployed and 3700 locations were migrated, which triggered a series of issues. A ` `combination of a code change, a later Customer Setting Service (CSS) outage and an increase of locations ` `using this application, caused the Director to become overwhelmed by the increased load. This resulted in a ` `backlog of missing DSS records and inaccurate reporting. ` `› Timeline ` `o February 2, 2025: ` ` Comida Director upgrade began rolling out to various locations in a structured approach. ` `o March 13, 2025: ` ` 4:45 pm: Pipeline for the Comida Director deployed to production. ` ` 7:00 pm: 3700 new locations were migrated to the Comida Director. ` `o March 14, 2025: ` ` 4:05 am: Alerts triggered for missing DSS. ` ` 4:30 am: Investigation commenced. ` ` 6:30 am: Peak missing DSS count reached 1,930; the problematic deployment from 3/13 ` `was reverted. ` ` 6:45 am: Recovery steps initiated. ` ` 9:15 am: A 911 call was started to identify a solution for improved recovery time. ` ` 9:31 am: Manual intervention further accelerated recovery. ` ` 10:05 am: Recovery reached acceptable levels; the incident call ended. ` ` 11:30 am: Full recovery achieved. ` `o March 16, 2025: ` ` 2:17 am: Customer Setting Service (CSS) went down. ` ` 2:27 am: CSS came back online. ` ` 2:30 am: 1,097 missing DSS records were noted. ` ` 2:42 am: CSS experienced another outage. ` ` 3:10 am: A post in the 911 channel indicated CSS downtime and failing imports.` `2 ` ` 3:17 am: CSS was restored. ` ` 7:00 am: Peak missing DSS count hit 6,286; manual steps using a local version of poolx ` `were initiated. ` ` 7:15 am: Early recovery signs observed with numbers slowly dropping. ` ` 8:00 am: 911 channel updated that support cases were still coming in and imports were ` `lagging. ` ` 8:02 am: It was confirmed that DSS records were loading—albeit very slowly. ` ` 11:24 am: Missing DSS count dropped to approximately 2,500. ` ` 12:50 pm: Full recovery was reached. ` `o March 17, 2025: ` ` 5:15 am: Degradation began with an alert for 744 missing DSS records. ` ` 7:30 am: Missing DSS count peaked. ` ` 7:57 am: A 911 call was initiated regarding the issue. ` ` 8:00 am: Manual intervention using poolx locally commenced. ` ` 8:30 am: DevOps scaled up the minimum replicas, allowing more than one Comida ` `Director instance to run (without effect). ` ` 8:40 am: An index was applied to the PosSystemTenantLocationDss collection (without ` `effect). ` ` 9:15 am: DSS team continued investigation. ` ` 9:36 am: Missing DSS count decreased to 894. ` ` 10:05 am: Missing DSS count further dropped to 122. ` ` 10:30 am: Full recovery was confirmed; concurrently, with Arch’s assistance, the team ` `identified the true root cause for all incidents over the last 4 days, which was related to ` `the Director not being able to scale as expected. The Director was processing only 10 ` `locations at a time. ` ` 11:00 am: A three-point action plan was formulated: ` ` Fix the code to increase parallelism from 10 to 100. ` ` Revert the additional 3700 locations migrated on March 13. ` ` Develop an app/process to relieve pressure on the Director (using poolx with ` `manifest). ` ` 4:40 pm: The merge request (MR) to increase parallelism to 100 was merged. ` ` 5:30 pm: The pipeline for the parallelism change was deployed to production. ` ` 5:40 pm: Logs confirmed the positive effect of the change. ` ` 6:15 pm: The Director had already created 1,000 jobs since the deployment. ` ` 10:00 pm: The extra locations added on March 13 were rolled back. ` `› Immediate Impact: Customers experienced missing or delayed DSS imports, leading to inaccurate sales ` `and labor data in reports. This disruption resulted in increased support cases and a degraded customer ` `experience over multiple days.` `3 ` Scope `› Affected Environments: All environments (CAN1, CAN2, Production, Post Production) ` `› Geographical Impact: N/A ` `› Service Impact: ` `o Primary: Daily Sales Summary (DSS) – missing or delayed data imports. ` `o Secondary: Comida Director – inability to process increased load leading to cascading failures. ` `o Tertiary: Customer Setting Service (CSS) – intermittent outages further compounded the issue. ` ` ` 1. Root Cause `› Cause Identification: The incident was triggered by a combination of a code change and the deployment ` `of a new pipeline for the Comida Director on March 13, which migrated 3700 locations. This, along with a ` `subsequent CSS outage and an increased number of locations using the application, caused the Director to ` `become overwhelmed. ` `› Technical Explanation: The root cause was determined to be the Comida Director’s limited parallelism—it ` `was processing only 10 locations at a time, which was insufficient to handle the increased load. This ` `limitation, compounded by CSS outages and the additional load from the new locations, led to a backlog of ` `missing DSS records and inaccurate reporting and Comida Director could not keep up. ` `› Contributing Factors: ` `o Increased load from the migration of 3700 locations. ` `o CSS instability with multiple outages on March 16. ` `o The initial code change and its subsequent reversion did not resolve the underlying scalability ` `issue. ` ` ` 1. Impact Analysis `› Number of Cases: ` `o March 14: 120 reported cases ` `o March 16: 92 reported cases ` `o March 17: 91 reported cases ` `› Severity of Impact: Severity 1 ` ` ` 1. Corrective Actions `› Immediate Fixes: ` `o March 14:` ‌ ` ` ` ` `4 ` ` Reverted the code change deployed on March 13. ` ` Executed manual interventions to process missing DSS records. ` `o March 16: ` ` Conducted manual recovery steps using a local poolx instance during CSS instability. ` `o March 17: ` ` Initiated early manual interventions and attempted scaling up Director replicas (which did ` `not resolve the issue). ` ` Applied an index to the PosSystemTenantLocationDss collection (without effect). ` ` Increased parallelism from 10 to 100 and rolled back the extra locations migrated on ` `March 13. ` `› Root Cause Resolution: A change was pushed across all environments to increase parallelism from 10 to ` `100, with a subsequent change increasing parallelism from 100 to 200 deployed on March 18 and an ` `increase from 200 to 400 on March 20. Additionally, the extra locations added on March 13 were rolled ` `back. As of March 26, parallelism has been increased to 800, with success. ` `› Communication: Real-time updates were communicated via the 911 channel and the status page was ` `used to update customers on March 14 and March 17. ` ` ` 1. Preventative Measures `› Process Improvements: ` `o Implement rigorous load testing in a dedicated dev environment to simulate production load. ` `o Establish a canary environment for the Comida Director to test code changes with a subset of ` `customers before full production rollout. ` `› System Enhancements: ` `o Develop an application/process to relieve load pressure on the Director using poolx with manifest. ` `o Continuously adjust and optimize the parallelism settings, as needed. ` `› Monitoring and Alerts: ` `o Continue to monitor ongoing throughput to proactively identify and adjust additional parallelism ` `configuration, as needed. ` ` ` `Document Approval: ` `› Prepared by: [Jeremy Eubanks, Associate Manager Information Security, 3/26/2025] ` `› Reviewed by: [Lauren Harden, Sr. Director Database, DevOps, Information Security, 3/26/2025] ` `› Approved by: [Jack Mossman, Senior Director, Integrations Technology, 3/28/2025] ` ` ` `Distribution: ` `› Intended Audience: Engineering teams, DevOps teams, and affected customers.`