Commerce Layer incident

Underline infrastructure issue impacting live services

Commerce Layer experienced a major incident on January 17, 2023 affecting Commerce API and Dashboard, lasting 42m. The incident has been resolved; the full update timeline is below.

Started: Jan 17, 2023, 04:02 PM UTC
Resolved: Jan 17, 2023, 04:45 PM UTC
Duration: 42m
Detected by Pingoru: Jan 17, 2023, 04:02 PM UTC

Affected components

Commerce APIDashboard

Update timeline

investigating Jan 17, 2023, 05:08 PM UTC

An issue generated on one of our backend cloud infrastructure components impacted our Core API and Dashboard environment. We are currently working to restore normal service functionality.
identified Jan 17, 2023, 05:11 PM UTC

The issue has been identified, a fix is currently being implemented on the backend infrastructure.
monitoring Jan 17, 2023, 05:13 PM UTC

Fix has been deployed and services' behavior is going back to normal. We are keeping the status monitored until complete recovery from the issue.
resolved Jan 17, 2023, 05:17 PM UTC

The issue can be considered as completely resolved.
postmortem Jan 19, 2023, 08:49 AM UTC

# Summary Some users experienced HTTP 500 errors in response to Commence Layer API calls on January 17 between 4:02PM and 4:30PM UTC. A maintenance activity on one of our databases has triggered the issue. Our monitoring tool detected the event and alerts have been sent promptly to our internal communication system. We fixed the issue by closing stalled DB connections and re-starting the API backend processes that were affected. # Leadup A maintenance activity was performed on one of our databases around 4:01PM UTC on January 17, 2023. It was completed without any errors in a few seconds, but shortly thereafter HTTP 500 errors began to appear in API backend logs and alerts began to appear on our internal communication system. # Fault There were a number of API calls that returned errors. Upon investigation, the symptoms were caused by a backend runtime environment that was unable to connect to the database through connection pooling. # Detection Our real-time monitoring system detected the first error at 4:01PM UTC and immediately sent alerts to our internal communication channels. By 4:02PM UTC, our System Administrators were already investigating the issue. # Root Cause We have completed the investigation with our cloud service provider. We have identified an inconsistency in the db connection pooling service that may happen during some special backup tasks. We have now in place a different way to connect to the DB which will avoid any problems in the future during similar operations. # Mitigation and resolution Between 4:05PM UTC 4:20PM UTC the Team attempted a few restarts of the affected runtime environment. This ended up with a partial mitigation of the issue. After that, they realized they needed to take a more aggressive approach, which led to a cold restart of the runtime environment and killing all database processes. By 4:30PM UTC all API calls stopped receiving errors and after another 15 minutes the Team deemed the incident as resolved. # Corrective and Preventative Measures During the post-mortem analysis we found and tested a procedure that allows us to repeat the same DB activity without any errors.