Production Service Degradation
Timeline · 5 updates
- investigating Apr 27, 2026, 10:27 AM UTC
We are currently experiencing an outage affecting our production environment. Users are able to log in, but may encounter an error page immediately after authentication. Our team has identified a potential root cause and is actively working on a fix. We will provide our next update by 11:00 UTC. We apologize for the inconvenience and appreciate your patience.
- investigating Apr 27, 2026, 10:43 AM UTC
The root cause of the issue has been identified and corrective changes have been applied. The system should now be operating normally again. We will continue to monitor closely to ensure stability. Thank you for your patience while we resolved this incident.
- monitoring Apr 27, 2026, 10:43 AM UTC
The root cause of the issue has been identified and corrective changes have been applied. The system should now be operating normally again. We will continue to monitor closely to ensure stability. Thank you for your patience while we resolved this incident.
- resolved Apr 27, 2026, 01:38 PM UTC
Following the mitigation measures applied, no further issues have been observed and system performance has remained stable. The incident is now considered resolved.
- postmortem Apr 28, 2026, 01:03 PM UTC
### Summary On April 27, 2026, customers experienced a service disruption that prevented them from accessing their projects within Archlet. Between 11:52 and 12:08 CET, customers experienced occasional failed requests. Between 12:08 and 12:25 CET, all customers were affected: while authentication remained functional, attempting to access any project returned an error page. From 12:25 CET onward, the platform was fully operational and customers were no longer impacted. No customer data was lost or compromised at any point, and no data integrity issues resulted from this incident. Our automated monitoring detected anomalous resource consumption at 11:46 CET, and at 11:53 CET an internal report of customer-facing impact escalated the issue to an active incident response. ### What happened A single complex optimization workload, processed by our engine service, consumed significantly more compute and memory resources than typical workloads of its kind. Due to how the workload was reprocessed after initial failures, the resulting resource pressure propagated across multiple parts of our production environment in sequence, eventually placing the entire system under critical load. As a result, dependent services became unable to serve customer requests. ### Resolution Our engineering team triaged the incident in parallel workstreams. Affected workloads were rebalanced onto fresh capacity, and as an additional precaution, traffic was migrated to a standby production cluster. Customer-facing functionality was fully restored at 12:25 CET. The team continued to monitor the platform closely and applied further configuration hardening throughout the afternoon, formally closing the incident at 15:33 CET after a sustained monitoring period. ### What we are doing about it While we detected and responded to the incident quickly, we recognize that the disruption should not have reached customers in the first place. We have identified the following improvements: * **Resource governance:** We are introducing stricter resource boundaries on the components involved, with safeguards designed to prevent resource pressure from propagating across our infrastructure. * **Workload isolation**: We are introducing stronger isolation between customer-facing computational workloads and platform-critical services. * **Improved observability:** We are enhancing our internal dashboards and alerting heuristics to more clearly distinguish between transient anomalies and incidents requiring immediate intervention, reducing time-to-action for similar signal patterns in the future. The majority of these measures are already in production, with the remainder rolling out within the coming days. ### Closing Reliability is foundational to the work our customers do on Archlet, particularly around time-sensitive sourcing decisions. We are sorry for the disruption this caused on April 27, and we thank our customers for their patience during the incident. We remain committed to transparent communication whenever our service falls short of expectations. If you have any questions about this incident or its impact on your account, please reach out to your customer success contact or our support team.