Piano.io incident

Some of Piano services in US environment affected by AWS incident

Piano.io experienced a major incident on October 20, 2025 affecting VX Checkout - US - buy.piano.io and API Endpoints - US - api.piano.io and 1 more component, lasting 5h 42m. The incident has been resolved; the full update timeline is below.

Started: Oct 20, 2025, 08:09 AM UTC
Resolved: Oct 20, 2025, 01:52 PM UTC
Duration: 5h 42m
Detected by Pingoru: Oct 20, 2025, 08:09 AM UTC

Affected components

VX Checkout - US - buy.piano.ioAPI Endpoints - US - api.piano.ioComposer Experience Execution - USUS - id.piano.ioPiano ID and VX transactional EmailsAPI Endpoints - EU - api-eu.piano.io

Update timeline

investigating Oct 20, 2025, 08:09 AM UTC

We are currently investigating this issue.
investigating Oct 20, 2025, 08:38 AM UTC

We are continuing to investigate this issue.
investigating Oct 20, 2025, 08:41 AM UTC

We are continuing to investigate this issue.
investigating Oct 20, 2025, 09:53 AM UTC

We are continuing to investigate this issue.
monitoring Oct 20, 2025, 09:56 AM UTC

A fix has been implemented and we are monitoring the results.
investigating Oct 20, 2025, 10:40 AM UTC

We are currently investigating this issue.
monitoring Oct 20, 2025, 11:17 AM UTC

A fix has been implemented and we are monitoring the results.
resolved Oct 20, 2025, 01:52 PM UTC

This incident has been resolved.
postmortem Oct 23, 2025, 10:31 AM UTC

# AWS US-EAST-1 Outage — Public RCA **Date:** 2025-10-20 **Status:** Resolved ## Summary On 2025-10-20, an outage in the AWS US-EAST-1 region caused interruptions to customer purchases, email delivery, reporting, and our sandbox environment. Core customer-facing services were restored by 09:20 UTC; sandbox services were fully restored by 17:30 UTC. We determined the outage was caused by multiple simultaneous infrastructure failures in the region that prevented our systems from processing requests and scaling to meet load. ## Impact and Timeline \(UTC\) * **06:50 - 09:20** Purchase transactions failed * **07:25 - 09:20** Email delivery was delayed * **07:35 - 09:25** Composer experienced degraded performance, particularly in server-side experiences and metering, which may have caused delays in applying updates * **06:50 - 11:05** Reporting that depends on our US-based API endpoints was unavailable across regions * **12:30 - 17:30** Sandbox environment unavailable due to lack of capacity ## Root Cause A combination of AWS service disruptions in US-EAST-1 prevented our systems from operating and scaling normally: * **DynamoDB service errors** blocked payment processing that relies on DynamoDB for scalable transaction state * **SQS queue failures** prevented queue processing used for email delivery and reporting workflows * **EC2 API disruptions** and widespread Spot instance reclamation prevented us from provisioning new instances. Existing capacity was constrained; in several cases it was unsafe to force-restart instances * **S3 failures** prevented timely application of configuration updates required to restore some services ## Actions Taken During the Incident * Immediately mobilized all OPS and on-call engineers to triage and mitigate * Engaged directly with our AWS account team and exchanged frequent status updates with AWS support * Tuned Karpenter to expand instance allocation across additional availability zones within the region to restore capacity where possible * Monitored system behavior continuously and evaluated the need for a cross-region disaster recovery \(DR\) activation; determined DR would carry risk of partial data loss and was not necessary once regional capacity was restored ## Resolution After AWS services gradually came back online, we restored capacity and brought services back online. Core customer-facing services \(purchases, emails, reporting\) were restored by 09:20 UTC; the sandbox environment returned to normal by 17:30 UTC after additional capacity and configuration propagation completed. ## Next Steps and Mitigations We are implementing the following actions to reduce the risk and impact of similar incidents: * Conduct a joint post-incident review with AWS to identify specific failure modes and remediation steps on their side * Add additional resilience for DynamoDB-backed flows \(e.g., improved fallback paths\) * Reduce reliance on Spot capacity for sandbox services * Improve S3 configuration propagation and rollback capabilities to avoid blocking recoveries * Consider simplifying critical email delivery in case queues are not available \(usual cross-regional queue rerouting would not work in this case\) ## Q&A **Q: Why didn't we have any failover mechanisms or backup plans in case anything like this happened?** A: Mitigations \(rerouting services\) were blocked by AWS-based dynamic reconfiguration subsystem failures. Full DR was considered overkill with too serious potential consequences \(extended full downtime, potential data loss\). **Q: Do we plan on improving for the future to better handle situations like these in terms of triggers or warnings beforehand?** A: We had monitors alerting almost immediately, so on-call engineers started to look into the issue right away. For plans, see the mitigations section above. **Q: Why did it take so long for Sandbox to be live again?** A: AWS did not have on-demand server instances available, and Spot instances were withdrawn. We couldn't reuse production cluster nodes for sandbox because the production cluster itself had considerable resource over-utilization due to the lack of available servers. --- We recognize the impact this incident had on customers and are committed to improving the resilience and availability of our platform. We will follow up with the findings from the joint post-incident review with AWS and a timeline for the planned mitigations.