Chargebee incident

AWS Outage Impacting Chargebee Services

Chargebee experienced a major incident on October 20, 2025 affecting Admin Console (US) and API (US) and 1 more component, lasting 16h 50m. The incident has been resolved; the full update timeline is below.

Started: Oct 20, 2025, 09:42 AM UTC
Resolved: Oct 21, 2025, 02:33 AM UTC
Duration: 16h 50m
Detected by Pingoru: Oct 20, 2025, 09:42 AM UTC

Affected components

Admin Console (US)API (US)Checkout (US)Webhooks (US)Dashboard (US)API (EU)Checkout (EU)Webhooks (EU)Admin Console (EU)Dashboard (EU)

Update timeline

identified Oct 20, 2025, 09:42 AM UTC

We are currently experiencing a service interruption due to an ongoing AWS outage. Our team is working with AWS to restore full access as soon as possible. For more details - https://health.aws.amazon.com/health/status We sincerely appreciate your patience and understanding during this time.
identified Oct 20, 2025, 09:47 AM UTC

We have an update from AWS, and we will continue to monitor: "We are seeing significant signs of recovery. Most requests should now be succeeding. We continue to work through a backlog of queued requests. We will continue to provide additional information."
identified Oct 20, 2025, 10:09 AM UTC

Latest note from AWS: "We continue to observe recovery across most of the affected AWS Services. We can confirm global services and features that rely on US-EAST-1 have also recovered. We continue to work towards full resolution and will provide updates as we have more information to share"
monitoring Oct 20, 2025, 11:07 AM UTC

Between 07:00 AM and 09:27 AM UTC, we experienced increased error rates and latency across multiple services due to an AWS outage specifically related to a DNS resolution issue impacting DynamoDB. Our systems have now recovered, and services are operating normally. We will continue to monitor and provide further updates as needed.
monitoring Oct 20, 2025, 02:07 PM UTC

AWS has identified and mitigated the issue affecting multiple dependent services. Most AWS operations have recovered, but some residual latency may persist as backlogs clear. We continue to monitor our systems closely for any impact.
monitoring Oct 20, 2025, 03:52 PM UTC

Some dependent issues with AWS are still being addressed. We continue to monitor our systems closely for any impact.
resolved Oct 21, 2025, 02:33 AM UTC

AWS has implemented the necessary fix and confirmed that all affected services have been fully restored. Our systems have also been monitored and are now operating normally. We appreciate your patience while we worked with our provider to restore full functionality. AWS will share a detailed post-incident summary, and we are marking this incident as resolved.
postmortem Oct 30, 2025, 07:43 AM UTC

# **Incident Overview:** **On October 20, 2025**, a major outage in the [**AWS US-EAST-1**](https://aws.amazon.com/message/101925/) region impacted multiple global services, including components critical to our infrastructure. Between **06:46 AM UTC and 09:30 AM UTC**, we experienced major failures in our systems primarily due to **DNS resolution** issues and **EC2 instance scaling** impairments within AWS. This resulted in degraded performance and limited availability of our services during that window. # **Root Cause and Timelines:** **06:46 AM – 09:30 AM UTC, 20th October \(Major Impact Window\)** * We began observing increased error rates and DNS lookup failures from **06:48 AM** UTC. Critical AWS services such as DynamoDB, SQS, STS, EC2, and Lambda were fully degraded. * The AWS outage - rooted in a DNS resolution failure of EC2 services - prevented us from provisioning new instances and caused some of our application servers to become unhealthy. Since we couldn’t provision new instances, to ensure uptime, we started shedding load on existing instances by moving low-priority workloads to a separate instance. * By **09:30 AM** UTC, core services were partially stabilized, though intermittent DNS errors persisted. # **Recovery:** **After 09:30 AM UTC, 20 October** * We noticed some application servers went unhealthy, and we restarted some of the servers as **autoscaling/new instance provisioning** was not available. * We continued to observe random **DNS lookup failures** from AWS, averaging about **2k errors** per hour. **Job Rescheduling and Completion:** * To avoid further load on the system during the AWS recovery phase, we paused rescheduling of failed jobs until AWS fully declared that their services were fully operational, esp their auto scaling functionality. * AWS gradually restored autoscaling and EC2 instances by **09:30 PM** UTC on **20th October**. * By **11:00 PM** UTC on 20th October, system stability had largely been restored. Some upstream third-party services were still not completely **recovered**. * All **critical jobs** were rescheduled in a staggered way so as not to overload the system and completed by **02:45 PM** UTC on 21st October. * All non-critical jobs were rescheduled, and the entire activity was completed by **02:15 PM** UTC on **22nd October.** We have treated this downtime as a key learning opportunity and established a dedicated internal team to enhance our tooling and processes for faster recovery.