PDF-Generator API incident
Service Degradation: Intermittent Failures in Document Generation
PDF-Generator API experienced a critical incident on August 13, 2025 affecting Document Generation Service, lasting 38m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- identified Aug 13, 2025, 05:15 AM UTC
Our team is currently addressing a service degradation impacting our document generation feature. We understand the importance of this service to your workflow, and our teams are working diligently to restore full functionality. What is happening? Starting at approximately 12:00 AM UTC, we identified an issue causing intermittent failures for document generation requests. Our monitoring indicates that approximately 50% of these requests are currently failing. Our investigation has traced the root cause of performance issues within our Redis cache layer in our primary AWS Region. What are we doing? Our DevOps and Engineering teams are actively working on a resolution. We are currently implementing remediation steps to restore cache stability and bring the service back to full health. We are confident in our team's ability to resolve this matter promptly. We sincerely apologise for the disruption this has caused and appreciate your patience.
- monitoring Aug 13, 2025, 05:50 AM UTC
A fix has been implemented and we are monitoring the results.
- resolved Aug 13, 2025, 05:53 AM UTC
This incident has been resolved.
- postmortem Aug 13, 2025, 08:25 AM UTC
### Summary On August 13, 2025, between approximately 02:30 and 08:44 \(UTC\+3\), the PDF Generator API service experienced a severe degradation. During this period, a significant percentage of API requests failed. The issue was resolved by failing over a component of our caching infrastructure, and the service has been stable since. ### Impact * **Service:** PDF Generator API * **Impact:** A high rate of failed API requests for all users. * **Duration:** Approximately 6 hours and 15 minutes of severe degradation. ### Timeline of Events \(UTC\+3\) * **August 12, 22:00:** A low volume of connection errors to our caching layer was first detected by our monitoring systems. The overall service health remained stable with no noticeable impact on users. * **August 13, 02:30:** A sharp increase in connection failures occurred, leading to a severe degradation of the PDF Generator API service. The on-call engineering team was engaged. * **August 13, 02:30 - 08:44:** The engineering team actively investigated the issue. During this time, the service continued to experience a high failure rate. * **August 13, 08:44:** A successful manual failover of a key component in our caching infrastructure was completed. Service functionality was immediately restored. ### Root Cause Analysis The direct cause of the incident was a widespread failure of our application servers to establish connections with our primary caching instance, which runs on Redis in an AWS cluster. Our investigation confirmed that the Redis server itself was healthy, responsive to direct connections, and processing data normally. All internal metrics for the server appeared normal throughout the incident. The issue was isolated to the network path between the application and the cache. Evidence, including a concurrent spike in reported issues for Amazon Web Services \(AWS\) on third-party monitoring sites like Downdetector, suggests the incident may have been related to a transient issue within the underlying cloud infrastructure. ### Resolution The immediate fix was to redirect all application traffic from the primary cache instance to a secondary hot-standby replica. This action successfully bypassed the connectivity problem, and full service functionality was restored upon completion of the failover. ### Action Items & Lessons Learned To prevent recurrence and improve our response, we are taking the following actions: * **Enhance Monitoring:** We are implementing more granular, application-level alerts to track connection success rates for our caching tier. This will allow for faster detection of this specific failure scenario. * **Automate Failover:** We will develop and test an automated failover procedure for the caching tier. This will significantly reduce recovery time \(RTO\) if a similar incident occurs in the future. * **Review Incident Protocols:** We will review and update our incident response protocols to formalise the process for investigating and escalating potential issues with our cloud provider. We apologise for the disruption this incident caused our users. We are committed to learning from this event and improving the reliability of our services.