PDF-Generator API incident

Document Generation Downtime

PDF-Generator API experienced a critical incident on June 21, 2024 affecting Document Generation Service, lasting 32m. The incident has been resolved; the full update timeline is below.

Started: Jun 21, 2024, 05:14 AM UTC
Resolved: Jun 21, 2024, 05:46 AM UTC
Duration: 32m
Detected by Pingoru: Jun 21, 2024, 05:14 AM UTC

Affected components

Document Generation Service

Update timeline

investigating Jun 21, 2024, 05:14 AM UTC

Our document generation processes are taking longer time than usual. We have identified the root cause and working on solving the issue.
monitoring Jun 21, 2024, 05:31 AM UTC

A fix has been implemented and we are monitoring the results.
resolved Jun 21, 2024, 05:46 AM UTC

This incident has been resolved.
postmortem Jun 21, 2024, 06:52 AM UTC

Today, we experienced our longest downtime in the ten years we have provided our document generation service. We apologize to all of our customers who were affected by the downtime. The saddest part of this incident is that we could have avoided it. We are improving our processes to avoid repeating our mistakes. **CHRONOLOGY OF THE INCIDENT** * Logged to Slack monitoring channel: 18:17 20.06.2024 UTC * Reports from customers via Crisp: 01:00 21.06.2024 UTC * Incident seen by technical team member: 08:00 21.06.2024 UTC * Issue resolved: 08:36 21.06.2024 UTC * Issue closed: 08:40 21.06.2024 UTC **IMPACT OF THE INCIDENT** The Cloud US Generator API service downtime affected all Cloud Service users, and none of the customers could generate documents during the incident. **THE ROOT CAUSE** The root cause of the incident was a misconfiguration of nginx. We synced configuration changes in ArcoCD on 20.06.2024 but didn't restart all the PODs. When new PODs were added or existing ones restarted, they failed to come up and caused the downtime. The existing alerting system logged issues to our Slack monitoring channel, but the health check considered the service running and didn't send out SMS notifications to the technical team. **LESSONS LEARNED** * After syncing configuration changes, we must restart all PODs and validate if everything is working as expected. This would have saved us from the incident. * The US-based support team should call the European-based technical team if they validated the incident outside of the EU's working hours. This would have allowed us to act much faster and solve the incident. * All health checks need to include sub-systems and microservices on which they depend. This would have notified the technical team much faster and reduced the downtime.