Applaud HR incident

PROD EU is down

Applaud HR experienced a critical incident on February 6, 2024 affecting Production EU, lasting 21m. The incident has been resolved; the full update timeline is below.

Started: Feb 06, 2024, 02:09 PM UTC
Resolved: Feb 06, 2024, 02:31 PM UTC
Duration: 21m
Detected by Pingoru: Feb 06, 2024, 02:09 PM UTC

Affected components

Production EU

Update timeline

investigating Feb 06, 2024, 02:09 PM UTC

We are currently investigating this issue.
monitoring Feb 06, 2024, 02:26 PM UTC

We identified the issue and resolved it. We have been monitoring the instance for some time and update the status.
resolved Feb 06, 2024, 02:31 PM UTC

We are done with all the checks, and we don't see any more issues with the instance. We understand the importance of your business processes and apologize for any inconvenience caused by this incident. Thank you.
postmortem Feb 14, 2024, 09:12 AM UTC

## Primary Symptom * Complete unavailability of all services in the EU region, resulting in downtime for all tenants and users accessing the system. ## Initial Investigation & Findings * Analysis revealed that upon termination of the old server within the Auto Scaling Group \(ASG\), the corresponding elastic IP associated with one of the proxy servers utilized by Route 53 \(DNS\) was released, leading to disruption of services. * Few of the customers encountered accessibility challenges, specifically with custom sign-in pages after restoring the services to the Production EU environment. ## Underlying Issues * Further analysis revealed that the reported accessibility issues from customers originated from the inadvertent disabling of cross-zone load balancing during updates to network load balancer IPs. This led to uneven traffic distribution across availability zones\(AZ\), causing instability. * Cross-zone load balancing plays a crucial role in ensuring the stability and resilience of our infrastructure by distributing incoming traffic evenly across healthy targets in multiple AZ. Due to its inadvertent deactivation, the load balancer nodes were only directing traffic to healthy targets within their respective AZ’s, thereby resulting in an unstable instance and the reported accessibility issues by the clients. # **Resolution** * Added back the old IP in DNS and attached that to one of the running proxy servers temporarily to fix the accessibility issues. * Enabled cross-zone load balancing in the Network Load balancer for proper traffic distribution across AZ. * In the next Production scheduled downtime, we will remove the outdated IP address associated with the proxy servers from the DNS configuration. # **Preventive Measures** * We are proactively implementing measures to mitigate automatic adjustments and enhance our configuration review protocols to ensure sustained system stability.