Togetherwork incident

[SEV-1] Togetherpay - Production Outage

Critical Resolved View vendor source →

Togetherwork experienced a critical incident on September 3, 2024 affecting Transaction Processing and Payment Tokenization and 1 more component, lasting 7h 36m. The incident has been resolved; the full update timeline is below.

Started
Sep 03, 2024, 10:22 AM UTC
Resolved
Sep 03, 2024, 05:59 PM UTC
Duration
7h 36m
Detected by Pingoru
Sep 03, 2024, 10:22 AM UTC

Affected components

Transaction ProcessingPayment TokenizationMerchant Endpoints

Update timeline

  1. investigating Sep 03, 2024, 10:22 AM UTC

    Today's production roll did not go according to plan and processing in production is down. We are working as quickly as possible to resolve the issue. Additional information will be provided when known, or when the situation is resolved.

  2. identified Sep 03, 2024, 11:14 AM UTC

    Production is still down and teams are actively working to restore it. The issue has been identified and a fix is being implemented. We will provide another update when more is know or services are completely restored.

  3. identified Sep 03, 2024, 11:56 AM UTC

    Transactions are processing normally. Teams are still working to fully fix the issue. We will provide another update when more is known or the issue is fully resolved.

  4. resolved Sep 03, 2024, 05:59 PM UTC

    The Togetherpay production release was successfully rolled back. All systems are fully functioning as they were prior to this morning's roll. This incident is resolved.

  5. postmortem Sep 06, 2024, 07:16 PM UTC

    Togetherwork identified the root cause of the failed 9/3 production deployment. It was primarily caused by: 1. GitLab downtime - initial delay in the deployment was due to GitLab being down, which also caused subsequent slowness in the pipeline 2. Database migration issues - new column migrations were not applied correctly, leading to application errors and failure in displaying merchants 3. Incomplete Rollback - the rollback did not fully restore the previous state, causing further site downtime. Corrective actions that are being implemented include: 1. Improved GitLab monitoring 2. Database migration testing 3. Improved rollback procedures 4. Pipeline optimization The 9/3 incident was resolved by fully reverting to a previous, stable branch. Between 1:00 a.m.-1:37 p.m. eastern, Togetherwork Products could have experienced intermittent processing issues. Two windows were identified as complete payment processing outages: 6:04 a.m. - 7:45 a.m. eastern 12:03 p.m. - 1:37 p.m. eastern The re-deploy is scheduled for Wednesday, 9/11 between 1:00 a.m. - 4:00 a.m. eastern.