Wakam incident

Major Outage

Critical Resolved View vendor source →

Wakam experienced a critical incident on November 16, 2023 affecting Pricing API and Developer Portal, lasting 2h 54m. The incident has been resolved; the full update timeline is below.

Started
Nov 16, 2023, 08:36 AM UTC
Resolved
Nov 16, 2023, 11:30 AM UTC
Duration
2h 54m
Detected by Pingoru
Nov 16, 2023, 08:36 AM UTC

Affected components

Pricing APIDeveloper Portal

Update timeline

  1. investigating Nov 16, 2023, 08:35 AM UTC

    We have received reports from some of our partners being unable to access our API. We are investigating further and we will provide update via this channel. Next update to come by 10:00.

  2. investigating Nov 16, 2023, 08:36 AM UTC

    We have received reports from some of our partners being unable to access our API. We are investigating further and we will provide update via this channel. Next update to come by 10:00.

  3. investigating Nov 16, 2023, 09:01 AM UTC

    We have identified an issue with a key equipment enabling the access to our api pricing platform that prevent part of our partners to generate quotes and to access our developer portal. We are investigating along with our service provider Microsoft Azure.

  4. investigating Nov 16, 2023, 09:39 AM UTC

    We are developing alternative migration strategies to provide a workaround. We will provide update further.

  5. identified Nov 16, 2023, 10:18 AM UTC

    We have identified a potential root cause for his issue on the API Manager service and our cloud provider Microsoft Azure is rolling back a recent deployment to mitigate it. An update will be provided in 30min or as events warrant.

  6. monitoring Nov 16, 2023, 11:11 AM UTC

    It is confirmed the outage was global and affecting all Microsoft Azure Cloud Provider clients including Wakam. Microsoft Azure Cloud Provider has proceeded to a rollback of our API Management Service to mitigate the issue. Service is progressively being back to normal and we are still actively monitoring it. More detail on the issue will be provided later on.

  7. resolved Nov 16, 2023, 11:30 AM UTC

    We confirm the service is now back to normal. A detailed resolution statement will be provided in the following days

  8. postmortem Dec 12, 2023, 09:30 AM UTC

    # What happened? Between 03:15 UTC and 10:30 UTC on 16 Nov 2023, many partners experienced Transport Layer Security \(TLS\) exceptions when using our APIs such as “The request was aborted: Could not create SSL/TLS secure channel”. # What went wrong? Our cloud provider Microsoft Azure faced a global issue with one of the service we use in our infrastructure : API Management. This issue impacted the following regions : North Europe, West Europe and West US. They determined that a recent deployment on the service introduced a configuration bug that caused failures when trying to create client connections. This configuration prevented TLS interactions from succeeding, returning exceptions for a subset of users. # How did we respond? The issue was reported at 7:32 UTC because our internal probes were not impacted by the issue. After the different checks on our infrastructure were done, we contacted Microsoft’s support at 8:37 UTC and they confirmed the issue at 10:13 UTC. They rolled back the recent deployment of API Management service to mitigate the issue. At 10:30 UTC the service was up again. # What happens next? After this outage, two actions have been decided : * Put in place new monitoring in addition to the current external probes * Improve the operational procedure to mitigate the impact in this kind of context NB: the official issue summary published by Microsoft Azure is available at [https://app.azure.com/h/6LSL-JCG/b34b8a](https://app.azure.com/h/6LSL-JCG/b34b8a).