Xander incident

Line Items Near End of their Budget Halting

Xander experienced a notice incident on June 20, 2025, lasting 3d 22h. The incident has been resolved; the full update timeline is below.

Started: Jun 20, 2025, 03:28 PM UTC
Resolved: Jun 24, 2025, 01:32 PM UTC
Duration: 3d 22h
Detected by Pingoru: Jun 20, 2025, 03:28 PM UTC

Update timeline

investigating Jun 20, 2025, 03:28 PM UTC

We are currently investigating the following issue:: Component(s): Ad Serving Impact(s): Line item pacing issues for objects near end of budget Not Impacted: UI API Geolocation(s): Global (Global) Status: We will provide an update as soon as more information is available. Thank you for your patience.
resolved Jun 24, 2025, 01:32 PM UTC

The incident has been fully resolved. We apologize for the inconvenience this issue may have caused, and thank you for your continued support.
postmortem Sep 10, 2025, 08:01 PM UTC

Line items near end of their budget halting Incident summary From approximately 23:29 UTC on Thursday, June 12 to 18:29 UTC on Tuesday, June 24, 2025, a platform-wide outage in ad-serving impacted all regions, services, and subscriptions, during the impact window. Incident impact Nature of impact(s): Line item pacing issue was detected for objects that were approaching the end of their budget. Some of the data was incomplete or inaccurate until it was reprocessed, requiring the necessary data to be pulled again as needed. Data was unavailable through affected service(s) Incident duration: ~283 Hours. 23:29 UTC on Tuesday, June 12 to 18:29 UTC on Tuesday, June 24, 2025. Scope: Global Components: Ad serving Timeframe (UTC) 2024-06-12 23:29: Incident started. 2024-06-18 16:06: Issue detected. 2024-06-19 13:39: Escalated to engineer. 2024-06-24 16:55: Mitigated. 2024-06-24 18:29: Recovered. Root cause The incident was caused by stale minute-level data being ingested into the budget database table between June 10th - June 12th, 2025. This inaccurate data resulted in the system incorrectly calculating that certain line items had exceeded their lifetime budgets. Consequently, our systems marked these line items as "out of budget," halting delivery for some line items globally and impacting both Microsoft Monetize and Invest clients. This artificially inflated the lifetime spend calculations, triggering a stop in budget pacing and delivery. The incident was detected on June 18th, 2025, through a combination of automated data validation and manual investigation prompted by multiple support cases reporting stalled line items. This resulted in temporary under-delivery for affected line items and disrupted campaign pacing, during the impact window. Resolution The issue was successfully mitigated through a coordinated series of targeted remediation actions: Reprocessing of jobs for the affected hours addressed the stale minute-level data in the budget database, ensuring accurate recalculation of line item budgets and pacing. Support team recommended jumpstarting campaign delivery through bulk edits or duplications, enabling campaigns to resume without further disruption. Queries were developed and executed to identify line items stuck in rollover for detailed analysis, and collaboration with the Engineering team ensured data sources related to rollover events were verified and corrected. Coordination with the reporting team addressed potential anomalies in analytics and reporting, restoring confidence in budget pacing and delivery metrics. Monitoring between database tables was introduced, and sync write timeouts were increased to reduce the risk of rebalancing issues in distributed consumer groups. These combined efforts restored full functionality for affected line items, stabilized budget pacing, and enabled normal campaign delivery to resume. This facilitated the application to resume its normal operation in subsequent runs and ensured that all affected services were fully restored. Immediate measures taken to mitigate the incident To stabilize the platform and mitigate further impact, the engineering team implemented a series of targeted, prompt actions: Partition and consumer optimizations: Partition pausing metrics were added, and consumer session timeouts, heartbeat intervals, and sync timeouts were increased to reduce lag and improve synchronization across distributed event streaming platforms. Pod sizes were also increased to enhance throughput and reduce total lag. Enhanced monitoring and alerting: Existing alerts were reviewed and refined, and additional alerts were introduced with improved threshold parameters to proactively detect failed database writes, warehouse latencies, and minute-level data discrepancies. This ensures faster detection and remediation of anomalies in data pipelines. Minute-level data validation: Detection mechanisms for minute-level data in the database were implemented, providing early warning of stale or inconsistent records to prevent future impact. System reliability improvements: These combined actions strengthened data consistency, reduced the likelihood of delivery disruption, and increased confidence in the platform’s operational stability. The result of this endeavour will lead to enhanced engineering procedures, and implementation of rigorous release protocols, all aimed at ensuring greater stability and to avert occurrences in the future that would potentially result in causing a similar incident.