DRACOON incident

Partial Outage of DRACOON Cloud

DRACOON experienced a major incident on March 31, 2025 affecting API (group 01) and API (group 02) and 1 more component, lasting 3h 26m. The incident has been resolved; the full update timeline is below.

Started: Mar 31, 2025, 05:30 AM UTC
Resolved: Mar 31, 2025, 08:56 AM UTC
Duration: 3h 26m
Detected by Pingoru: Mar 31, 2025, 05:30 AM UTC

Affected components

API (group 01)API (group 02)API (group 03)API (group 04)API (group 05)API (group 06)API (group 07)API (group 08)API (group 09)

Update timeline

investigating Mar 31, 2025, 06:06 AM UTC

We are currently investigating an issue with DRACOON Cloud. Our team is working to gather more information and resolve the issue as quickly as possible. We apologize for any inconvenience this may cause and will provide updates as soon as we have them.
monitoring Mar 31, 2025, 06:11 AM UTC

The issue with DRACOON Cloud has been resolved, and we are monitoring the situation to ensure it remains stable. We apologize for any inconvenience this may have caused and appreciate your patience.
resolved Mar 31, 2025, 08:56 AM UTC

The issue with DRACOON Cloud has been fully resolved. All systems are now operating normally. We apologize for any inconvenience this may have caused and appreciate your patience. If you continue to experience any issues, please don't hesitate to reach out to our support team for assistance.
postmortem Sep 09, 2025, 03:18 PM UTC

We experienced an issue with DRACOON Cloud on 2025-03-31 from around 07:30 to 08:15. Our team has worked diligently to identify the root cause and implement a resolution. In this post-mortem, we want to share the details of what happened, why it happened, what we did to resolve it, and what we will do to prevent similar incidents in the future. What happened? DRACOON Cloud experienced performance degradation during early usage hours, affecting user access and normal operation. Why did this happen? Application containers hit memory limits during high traffic periods, causing automatic restarts and service interruptions as the container orchestration system cycled through unhealthy instances. The memory limits were set too conservatively and hadn't been updated to account for certain traffic spikes. What did we do? Our engineering team quickly identified the container restart pattern through application logs and monitoring dashboards. We immediately increased the memory limits for affected services and scaled up the number of container replicas to distribute the load. What can we do to improve? We will improve our monitoring, update memory limits based on actual usage patterns, and create automated scaling policies that proactively increase resources before hitting limits. We apologize for any inconvenience this incident may have caused. We are committed to ensuring the stability and reliability of our services and will continue to take proactive measures to prevent similar incidents from happening in the future. If you have any questions or concerns, please don't hesitate to reach out to our support team for assistance.