Document Drafter incident

Service interruption

Minor Resolved View vendor source →

Document Drafter experienced a minor incident on October 9, 2025 affecting REST API and API Services - EU and 1 more component, lasting 6h 55m. The incident has been resolved; the full update timeline is below.

Started
Oct 09, 2025, 10:12 AM UTC
Resolved
Oct 09, 2025, 05:07 PM UTC
Duration
6h 55m
Detected by Pingoru
Oct 09, 2025, 10:12 AM UTC

Affected components

REST APIAPI Services - EUAPI Services - EU 2API Services - US EastAPI Services - Canada CentralAPI Services - Switzerland NorthDocument Drafter Portal

Update timeline

  1. investigating Oct 09, 2025, 08:26 AM UTC

    We are experiencing issues with Azure DNS resolution which may cause partial outage in some portal instances. We are investigating the issue and will update.

  2. investigating Oct 09, 2025, 09:20 AM UTC

    We are continuing to investigate the issue. There are large outages on Microsoft Azure's side. We will keep you updated as soon as we know more. We apologies for the interruption. To check health on Azure: https://statusgator.com/services/azure

  3. investigating Oct 09, 2025, 10:11 AM UTC

    We are continuing to investigate this issue.

  4. identified Oct 09, 2025, 10:12 AM UTC

    Microsoft has now reported that there is an outage and global issue, you can check the same updated on Azure's Status page: Impact Statement: Starting at 07:40 UTC on 09 October 2025, Azure customers using Azure Front Door (AFD) may experience intermittent issues when accessing their services, this includes the ability to access the Azure Portal. Current Status: Our monitoring detected a significant capacity loss, of about 30% of Azure Front Door instances predominantly across Europe and Africa. At this stage we have ruled out any deployments which could have triggered this event. The next update will be provided within 60 minutes, or as events warrant. This message was last updated at 10:14 UTC on 09 October 2025 https://azure.status.microsoft/en-us/status We will keep updating you.

  5. identified Oct 09, 2025, 10:18 AM UTC

    We are continuing to work on a fix for this issue.

  6. identified Oct 09, 2025, 11:03 AM UTC

    We are continuing to work on a fix for this issue.

  7. identified Oct 09, 2025, 11:04 AM UTC

    We are continuing to work on a fix for this issue. We have received this update from Microsoft: Current Status: Our monitoring detected a significant capacity loss of about 30% of Azure Front Door instances, predominantly across Europe, Middle East, and Africa. We understand that this is due to a dependency on some underlying Kubernetes instances that crashed. We have ruled out any deployments that could have triggered this event. We have been restarting these underlying Kubernetes instances, and AFD instances are coming back online. Customers should start seeing recovery as we bring these instances back online, and we expect full mitigation within the next 90 minutes. The next update will be provided within 60 minutes. This message was last updated at 11:01 UTC on 09 October 2025

  8. identified Oct 09, 2025, 11:40 AM UTC

    We are continuing to work on a fix for this issue and have a new update from Microsoft. Impact Statement: Starting at 07:40 UTC on 09 October 2025, Azure customers using Azure Front Door (AFD) may experience intermittent delays or timeouts when accessing their services. This includes the ability to access the Azure Portal and the Entra Admin Portal. Current Status: As we continue our mitigation efforts, we have successfully recovered 96% of impacted resources, teams are working on recovering the remaining 4% of impacted customers. The next update will be provided within 60 minutes. This message was last updated at 11:36 UTC on 09 October 2025

  9. identified Oct 09, 2025, 02:26 PM UTC

    We have another update from Microsoft. Issue should be all but resolved. Impact Statement: Starting at 07:40 UTC on 09 October 2025, Azure customers using Azure Front Door (AFD) from EMEA, Middle East and Africa may experience intermittent delays or timeouts when accessing their services. Current Status: Azure customers using Azure Front Door (AFD) should see consistent availability with slightly higher latencies decreasing as we continue to recover additional resources. The next update will be provided within 60 minutes. This message was last updated at 14:05 UTC on 09 October 2025

  10. identified Oct 09, 2025, 04:09 PM UTC

    We are continuing to work on a fix for this issue and have a new update from Microsoft. Impact Statement: Starting at 07:40 UTC on 09 October 2025, Azure customers using Azure Front Door (AFD) from EMEA, Middle East and Africa may experience intermittent delays or timeouts when accessing their services. Current Status: Azure customers using Azure Front Door (AFD) should see consistent availability with slightly higher latencies decreasing as we continue to recover additional resources. The next update will be provided within 60 minutes. This message was last updated at 16:03 UTC on 09 October 2025

  11. identified Oct 09, 2025, 05:06 PM UTC

    Microsoft has announced that the issue is resolved. We will communicate once we obtain a post mortem.

  12. resolved Oct 09, 2025, 05:07 PM UTC

    This incident has been resolved.

  13. postmortem Oct 30, 2025, 07:25 AM UTC

    We have received the following preliminary post mortem from Microsoft. ‌ We have received the following preliminary post mortem from Microsoft. COMMUNICATION: _Join one of our upcoming 'Azure Incident Retrospective' livestreams discussing this incident \(to hear from our engineering leaders, and to get any questions answered by our experts\) or watch a recording of the livestream \(available later, on YouTube\):_ [_https://aka.ms/AIR/QNBQ-5W8_](https://aka.ms/AIR/QNBQ-5W8) **What happened?** Between 07:50 UTC and 16:00 UTC on 09 October 2025, Microsoft services and Azure customers leveraging Azure Front Door \(AFD\) and Azure Content Delivery Network \(CDN\) may have experienced increased latency and/or timeouts – primarily across Africa and Europe, as well as Asia Pacific and the Middle East. This impacted the availability of the Azure Portal as well as other management portals across Microsoft. Peak failure rates for AFD reached approximately 17% in Africa, 6% in Europe, and 2.7% in Asia Pacific and the Middle East. Availability was restored by 12:50 UTC, though some customers continued to experience elevated latency. Latency returned to baseline levels by 16:00 UTC, at which point the incident was mitigated. **What do we know so far?** AFD routes traffic using globally distributed edge sites and supports Microsoft services including the management portals. The AFD control plane generates system metadata that the data plane consumes for customer-initiated ‘create’, ‘update’, or ‘delete’ operations on AFD or CDN profiles. One of the trigger conditions for this incident was a software defect in the latest version of the AFD control plane which had been rolled out six weeks prior to the incident, in line with our safe deployment practices. Newly created customer tenant profiles were being onboarded to the newer control plane version. Our service monitoring detected elevated data plane crashes due to a previously unknown bug – triggered by erroneous metadata, generated by a particular sequence of profile update operations. Our automated protection layer intercepted this in early update stages and prevented this metadata from propagating any further to the data plane, thereby averting any customer impact at that time. In addition, as the newer control plane was running in tandem with the previous version of the control plane, we disabled the new control plane from taking any requests. On 09 October 2025, we initiated a cleanup of the affected tenant configuration with the erroneous metadata. Since the automated protection system was blocking the impacted customer tenant profile updates in the initial stage, we temporarily bypassed it to allow the cleanup of the tenant configuration to proceed. By bypassing the protection system, the erroneous metadata was inadvertently able to propagate to later stages – and triggered the bug in the data plane that crashed the data plane service. This resulted in a disruption to a significant number of edge sites across Europe and Africa, approximately 26% of AFD data plane infrastructure resources in these regions were impacted. As part of AFD mechanisms to manage traffic, load was automatically distributed to nearby edge sites \(including in Asia Pacific and the Middle East\). Additionally, as regional business hours traffic started ramping up, it added to the overall traffic load. The increased volume of traffic on the remaining healthy edge sites resulted in high resource utilization, which exceeded operational thresholds. This triggered an additional layer of protection which started distributing traffic to a broader set of edge sites globally, to reduce further impact. Recovery required a combination of automated restarts, manual intervention where automated restarts were taking too long, and traffic failover operations for impacted management portals. Full mitigation was achieved once edge site infrastructure resources stabilized and latency returned to normal. Additionally, initial customer notifications were delayed primarily due to challenges determining impact, while attempting to target communications to those impacted. We have automated communications to notify customers of incidents quickly, unfortunately this capability was not yet supported in this incident scenario. **How did we respond?** * 07:30 UTC on 09 October 2025 – The cleanup operation was initiated. * 07:50 UTC on 09 October 2025 – Initial customer impact began, and increased over the next 90 minutes. * 08:13 UTC on 09 October 2025 – Our telemetry detected resource availability loss across multiple AFD edge sites. We began investigating as impact continued to grow. * 09:04 UTC on 09 October 2025 – We identified that the crashes were due to the previously identified data plane bug. * 09:08 UTC on 09 October 2025 – Automated restarts began for our AFD infrastructure resources, and manual intervention began for resources that did not recover automatically. * 09:15 UTC on 09 October 2025 – Customer impact had grown to be at its peak. * 10:01 UTC on 09 October 2025 – Communications were published to the Azure Status page. * 10:45 UTC on 09 October 2025 – Targeted customer communications were sent to Azure Service Health. * 11:59 UTC on 09 October 2025 – Management portals, like the Azure Portal, performed failover operations \(including using scripts to update the load balancing configuration, to split traffic between multiple routes\) helping restore its service availability. * 12:50 UTC on 09 October 2025 – Availability for AFD fully recovered, however a subset of customers may still have been experiencing elevated latency. * 16:00 UTC on 09 October 2025 – After continuous monitoring of latency improvement, we declared the incident as mitigated after confirming recovery. **How are we making incidents like this less likely or less impactful?** * We have hardened our standard operating procedures, to ensure that the configuration protection system is not bypassed for any operation. \(Completed\) * We have fixed the control plane defect which generated the erroneous tenant metadata that led to the data plane resource crashes. \(Completed\) * We have fixed the bug in the data plane. \(Completed\) * We will expand the automated customer alerts sent via Azure Service Health, to include similar classes of service degradation. \(Estimated completion: November 2025\) * We are making improvements to our Azure Portal failover systems from AFD, to be more robust and automated. \(Estimated completion: December 2025\) * We are building additional runtime configuration validation pipelines against a replica of real-time data plane, as a pre-validation step prior to applying them broadly. \(Estimated completion: March 2026\) * We are improving data plane resource instance recovery time, following any impact to the data plane. \(Estimated completion: March 2026\) **How can customers make incidents like this less impactful?** * Consider implementing failover strategies with Azure Traffic Manager, to fail over from Azure Front Door to your origins: [https://learn.microsoft.com/azure/architecture/guide/networking/global-web-applications/overview](https://learn.microsoft.com/azure/architecture/guide/networking/global-web-applications/overview) * Consider reviewing our best practices for Azure Front Door architecture: [https://learn.microsoft.com/azure/well-architected/service-guides/azure-front-door](https://learn.microsoft.com/azure/well-architected/service-guides/azure-front-door) * Consider implementing retry patterns with exponential backoff, to improve workload resiliency: [https://learn.microsoft.com/azure/architecture/patterns/retry](https://learn.microsoft.com/azure/architecture/patterns/retry) * More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: [https://aka.ms/AzPIR/WAF](https://aka.ms/AzPIR/WAF) * The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact: [https://aka.ms/AzPIR/Monitoring](https://aka.ms/AzPIR/Monitoring) * Finally, consider ensuring that the right people in your organization will be notified about any future service issues – by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: [https://aka.ms/AzPIR/Alerts](https://aka.ms/AzPIR/Alerts) **How can we make our incident communications more useful?** You can rate this PIR and provide any feedback using our quick 3-question survey: [http://aka.ms/AzPIR/QNBQ-5W8](http://aka.ms/AzPIR/QNBQ-5W8)