Uploadcare experienced a major incident on October 20, 2025, lasting —. The incident has been resolved; the full update timeline is below.
Update timeline
- resolved Oct 20, 2025, 09:48 AM UTC
On October 20, 2025, between 06:49 UTC and 09:27 UTC, a portion of our URL API service experienced a major disruption. The service responsible for real-time processing and transformation of uncached files was unavailable, resulting in failed requests for those specific operations. The delivery of already-cached files and all other Uploadcare services, including file uploading and management, were unaffected.
- postmortem Oct 20, 2025, 09:53 AM UTC
On October 20, 2025, between 07:50 UTC and 09:27 UTC, a portion of our URL API service experienced a major disruption. The service responsible for real-time processing and transformation of uncached files was unavailable, resulting in failed requests for those specific operations. The delivery of already-cached files and all other Uploadcare services, including file uploading and management, were unaffected. The direct root cause of this incident was a major service outage in the Amazon Web Services \(AWS\) `us-east-1` \(N. Virginia\) region, specifically impacting AWS DynamoDB. During the incident, we also identified a critical flaw in our incident communication process: we were unable to access our Atlassian-hosted status page to provide timely updates to our customers. It was affected by AWS services disruption, too. Our key follow-up actions will focus on improving the architectural resilience of our services to reduce single points of failure and establishing a more robust, independent system for incident communications. ## Timeline of events All times are in UTC on October 20, 2025. * **06:49:** Our internal monitoring systems begin to alert on a significant increase in error rates and latency for the URL API service, specifically for requests involving uncached files. Our engineering team begins an immediate investigation. * **07:03**: Our engineering team localizes issue with DynamoDB. * **07:37:** The team attempts to update our public status page \(`status.uploadcare.com`\) to inform customers but is unable to log in to the third-party provider \(Atlassian Statuspage\). * **07:46:** We get to conclusion that the issue is related to DNS of DynamoDB. * **07:51:** AWS posts its first notification acknowledging increased error rates and latencies for multiple services in the `us-east-1` region. Our team correlates our internal alerts with this broader AWS issue. * **08:26:** AWS confirms that the issue is centered on significant error rates for DynamoDB in `us-east-1`, confirming our initial diagnosis of the root cause affecting our service. * **09:01:** AWS reports they have identified a potential root cause related to DNS resolution for the DynamoDB API endpoint and are working on mitigations. * **09:27:** Our internal monitoring shows that error rates for the URL API have returned to normal levels. The service is fully recovered and operational. We declare the incident resolved internally. ## What went well * Our internal monitoring systems detected the service failure immediately, allowing for a rapid start to our investigation. * Our engineering team was able to quickly correlate the internal issue with the external AWS service outage, preventing wasted time on internal diagnostics. ## What went wrong * **Architectural single point of failure:** Our URL API for uncached files had a hard dependency on a single service \(DynamoDB\) within a single AWS region \(`us-east-1`\). The failure of this service led directly to the failure of ours, with no mechanism for graceful degradation or failover. * **Failure of incident communication:** Our designated channel for customer communication, the status page hosted at `status.uploadcare.com`, was inaccessible to our team during the outage. This failure prevented us from providing timely and transparent updates to our users, which is a critical part of our incident response protocol. ## Action items * **Establish a redundant status communication channel:** We will set up a secondary, fully independent channel for incident communication. This will ensure that if our primary status page provider is ever unavailable, we can still communicate effectively with our customers. * **Architect for resilience and decentralization:** Our engineering team will conduct a full architectural review of the URL API service. The primary goal is to re-architect the service to remove DynamoDB in `us-east-1` as a single point of failure. This may involve implementing multi-region failover capabilities or introducing a more resilient data caching layer. * **Audit critical services for single points of failure:** We will expand our review to other critical services across the Uploadcare platform to identify and mitigate other potential single points of failure related to external, regional dependencies. We sincerely apologize for the disruption this incident caused our customers and for our failure to communicate the issue in a timely manner. We are committed to learning from this event and implementing these changes to build a more resilient and reliable platform.