AdGuard incident

Limited availability of our services

AdGuard experienced a minor incident on July 11, 2025 affecting Website & services and AdGuard DNS and 1 more component, lasting 1h 51m. The incident has been resolved; the full update timeline is below.

Started: Jul 11, 2025, 11:52 AM UTC
Resolved: Jul 11, 2025, 01:43 PM UTC
Duration: 1h 51m
Detected by Pingoru: Jul 11, 2025, 11:52 AM UTC

Affected components

Website & servicesAdGuard DNSAdGuard VPN

Update timeline

investigating Jul 11, 2025, 11:52 AM UTC

We are currently investigating a network issue that is causing degraded availability of our websites, AdGuard VPN, and AdGuard DNS.
identified Jul 11, 2025, 12:11 PM UTC

We've identified the issue, and currently are implementing the fix.
monitoring Jul 11, 2025, 01:09 PM UTC

The fix has been applied successfully. We've currently monitoring our services for any lingering impact.
monitoring Jul 11, 2025, 01:09 PM UTC

We are continuing to monitor for any further issues.
resolved Jul 11, 2025, 01:43 PM UTC

This incident has been resolved.
postmortem Jul 11, 2025, 01:44 PM UTC

**Summary** Our internal infrastructure experienced approximately one hour of degraded availability due to a failure in our network edge. One of our upstream provider’s routers in the datacenter became unreachable. As a result, one of our two edge routers lost upstream connectivity. While this is a relatively common failure scenario — and one we are explicitly architected to tolerate — our redundancy mechanism did not operate as expected. **Architecture Overview** Our edge routing stack is designed for high availability. It consists of two physical routers configured to act as a single logical gateway using a shared IP address. From the perspective of connected systems, this setup \(often referred to as MLAG-style L3 redundancy or a "virtual router"\) appears as a single device. Inbound traffic is typically distributed across both routers based on hashing \(ECMP or per-flow load balancing\), and under normal circumstances, either router can forward traffic upstream. **What Went Wrong** When one of the upstream links failed:The affected router remained active in the logical group and continued accepting traffic. The hash-based forwarding mechanism continued to assign flows to both routers, including the one that had no upstream connectivity. As a result, approximately half of the traffic was routed to a black hole — silently dropped by the router with no upstream. This manifested externally as intermittent availability — services appeared "flaky" or unreachable in ~50% of cases depending on which path the packet was hashed to. **Mitigation** The immediate resolution involved manually removing the non-functional router from the logical group. This is a non-trivial operation, as simply powering down the router can have unintended side effects. The process took longer than expected due to its operational complexity. Once the faulty node was excluded, all traffic was successfully routed through the healthy router, and services stabilized. **Next Steps** We are currently investigating why automatic failover did not trigger as expected. Our routers are designed to detect upstream failures and withdraw from the logical group accordingly, but that mechanism failed silently. As a follow-up, we will: * Reproduce the failure scenario in a controlled environment * Validate and adjust failover and tracking logic * Improve observability for edge failover behavior * Develop faster manual intervention playbooks