Togetherwork incident

[SEV-1] Togetherpay - Intermittent Processing Issues

Major Resolved View vendor source →

Togetherwork experienced a major incident on October 14, 2024 affecting Transaction Processing and Payment Tokenization and 1 more component, lasting 40m. The incident has been resolved; the full update timeline is below.

Started
Oct 14, 2024, 07:00 PM UTC
Resolved
Oct 14, 2024, 07:40 PM UTC
Duration
40m
Detected by Pingoru
Oct 14, 2024, 07:00 PM UTC

Affected components

Transaction ProcessingPayment TokenizationMerchant EndpointsProPay Production Environment

Update timeline

  1. investigating Oct 14, 2024, 07:00 PM UTC

    ProPay is aware of an issue that is preventing some/possibly all transactions from being processed. They are actively working to resolve the issue. As soon as additional information is known, we will let you know.

  2. monitoring Oct 14, 2024, 07:08 PM UTC

    ProPay implemented a fix and is monitoring the results.

  3. monitoring Oct 14, 2024, 07:17 PM UTC

    We continue to see transaction timeouts. ProPay's fix does not appear to have slowed nor fixed the issue. We continue to monitor the situation.

  4. monitoring Oct 14, 2024, 07:27 PM UTC

    ProPay continues to apply fixes and monitor results. We are seeing a significant improvement in processing. All teams are still monitoring the situation.

  5. resolved Oct 14, 2024, 07:40 PM UTC

    ProPay implemented their fix, monitored results and resolved the incident. We noticed significant timeouts between 2:26 p.m. - 3:18 p.m. eastern. Processing is back to normal. This incident is resolved.

  6. postmortem Oct 28, 2024, 07:43 PM UTC

    Regarding the incident on Monday, October 14 between 2:25 p.m. - 3:25 p.m. eastern: ProPay identified several servers in an unhealthy state after maintenance was performed to relieve an increased size of the replication distribution database. This maintenance caused a disconnect between the application layer and the database layer, resulting in transactions timeouts. Resolution: Support teams performed a failover to the secondary server, which restored service. Additional review is being performed to improve monitoring, alerting and response plans for any similar situations in the future.