Commerce Layer incident

API/Dashboard/Applications back to normal

Commerce Layer experienced a critical incident on April 9, 2024 affecting Commerce API and Checkout and 1 more component, lasting 3h 48m. The incident has been resolved; the full update timeline is below.

Started: Apr 09, 2024, 07:25 AM UTC
Resolved: Apr 09, 2024, 11:14 AM UTC
Duration: 3h 48m
Detected by Pingoru: Apr 09, 2024, 07:25 AM UTC

Affected components

Commerce APICheckoutDashboard

Update timeline

investigating Apr 09, 2024, 07:25 AM UTC

We are aware that our API/Dashboard/Applications are currently not available. We are investigating it with the highest priority and will provide updates here.
investigating Apr 09, 2024, 07:26 AM UTC

We are continuing to investigate this issue.
identified Apr 09, 2024, 08:36 AM UTC

We are working with our service provider to resolve the issue and restore the normal operations.
monitoring Apr 09, 2024, 09:06 AM UTC

A fix has been applied. We are monitoring the situation and will publish a detailed incident report.
resolved Apr 09, 2024, 11:14 AM UTC

The issue is resolved and the incident, closed. We're working on the root cause analysis, a post-mortem report will be available soon.
postmortem Apr 22, 2024, 02:03 PM UTC

# **Summary** On April 9th, from 6:45 AM to 8:55 AM UTC, during the scheduled maintenance window, we encountered a database issue. The database version upgrade process failed and the automatic recovery service provided by our cloud infrastructure provider did not activate. We immediately contacted the infrastructure provider’s support team. They addressed this problem and there was no resulting data loss. # **Leadup** On April 9th, 2024, we began the DB upgrade maintenance at 6:39 AM UTC. At approximately 6:45 AM UTC, the background task provided by the DB cloud service failed, leaving the entire upgrade procedure stuck in an intermediate state. # **Fault** During the incident, all Core API calls resulted in errors. These errors were spread across all endpoints and customers, depending on their traffic distribution and type. The read requests cached on our CDN continued to work without errors. # **Detection** A few minutes after beginning maintenance, our engineers realized that the automatic upgrade process was not responding. The cloud service provides safeguards during upgrades that automatically restores normal operations should a timeout exceed 5 minutes. This expected action did not occur. # **Root Cause** The service provider's support team confirmed that the upgrade process, once initiated, encountered an unexpected issue within their internal procedure, causing the upgrade to stall. The automatic recovery safeguard began as expected five minutes after the incident. However, it couldn't complete because one of the read replicas was in an inconsistent state. The vendor is still investigating the cause of this issue. Since the database service is fully managed by the provider, our engineers had to wait for the provider's support team to restore the normal level of availability. # **Mitigation and resolution** A few minutes before 7:00 AM UTC, our engineers tried to manually interrupt the process, but the option to cancel it was disabled by the vendor. Our team immediately engaged the service provider’s support team, requesting to interrupt the process and restore the availability of the DB, with the highest urgency. The provider’s support team started investigating through their own tools and after a few minutes suggested a couple of workarounds that didn’t succeed. In parallel, we began our own restoration process from a backup. At about 8:50 AM UTC, the service provider’s support team was able to restore the DB availability and our services gradually recovered to normal operation levels. At approximately 8:55 AM UTC, the entire resolution was completed. # **Corrective and Preventative Measures** Given the nature of the cloud infrastructure on which our services are built upon, not all operational steps are in our full control. However, we identified improvements that we intend to implement on processes and procedures, together with our service provider: * We will directly involve the service provider’s representatives in future maintenance operations, from planning through implementation. * We will update our status page as soon as the issue occurs. We will extend this by sending a proactive alert to organization owners. * We will schedule the maintenance window during a low traffic period on our platform.