PlayFab incident

Action Processing Delay

PlayFab experienced a minor incident on July 31, 2025 affecting Event Processing, lasting 6h 38m. The incident has been resolved; the full update timeline is below.

Started: Jul 31, 2025, 10:17 PM UTC
Resolved: Aug 01, 2025, 04:55 AM UTC
Duration: 6h 38m
Detected by Pingoru: Jul 31, 2025, 10:17 PM UTC

Affected components

Event Processing

Update timeline

investigating Jul 31, 2025, 10:17 PM UTC

We are currently experiencing delayed processing of actions for rule and segment automation. Engineers are working to resolve this as soon as possible.
investigating Aug 01, 2025, 12:31 AM UTC

We are continuing to investigate and testing a potential mitigation to improve action processing throughput.
monitoring Aug 01, 2025, 02:01 AM UTC

A fix has been deployed and we are continuing to monitor as processing catches up.
resolved Aug 01, 2025, 04:55 AM UTC

This incident has been resolved.
postmortem Aug 12, 2025, 11:47 PM UTC

On July 31, 2025, between 10:30 AM and 7:29 PM PDT, some customers experienced significant delays when using PlayStream actions for rules and segments. Action executions were delayed, with a maximum delay of over 500 minutes at peak. The incident was caused by a combination of load and configuration values for the maximum number of records each processor could read at once, which, combined with pod health logic, led to excessive memory usage and processor failures. We resolved the issue by reducing the configuration value, which restored healthy processing across all partitions. ### Impact All PlayFab titles using PlayStream actions for rules and segments were impacted. Action executions were delayed but not dropped; however, the prolonged delay meant that some actions may not have been useful by the time they were processed. ### Root Cause Analysis The incident was caused by a misconfiguration in the number of records each processor attempted to read, combined with a change in the logic for partition allocation per processor. As processor pods failed due to memory exhaustion, the remaining healthy pods became overloaded, leading to a cascading failure and increasing delays in action processing. ### Action Items To prevent similar incidents from happening again, we have taken the following actions: * We reduced the maximum number of records each processor can read at once, improving processor reliability and preventing memory exhaustion. * We improved our monitoring and alerting to detect abnormal processor delays and memory usage earlier.