Voucherify experienced a minor incident on August 20, 2021 affecting AS1 - API and AS1 - Dashboard, lasting 3h 28m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Aug 20, 2021, 12:38 PM UTC
We are currently investigating the issue. We have noticed too long response time for some API methods. This problem occurs only in the AS1 environment which is hosted in the Singapore AWS region.
- monitoring Aug 20, 2021, 01:59 PM UTC
We are currently monitoring the platform to see if response times have stabilized.
- resolved Aug 20, 2021, 04:06 PM UTC
This incident has been resolved. The SRE team analyzed that topic and planned improvements for the near future. Our optimistic estimation assumes that we will release further improvements in next week.
- postmortem Aug 29, 2021, 06:10 PM UTC
We want to share more details with our customers and readers on the internet outages that occurred on the 20th of August 2021 and what we are doing to prevent these from happening again. **Incident** August 20th, at 8:12 UTC, our systems detected increased requests latency and total outage of one of our major nodes responsible for processing validation, redemption, and publication API methods. The node manages the customers, orders, vouchers, redemptions entities, and all operations related to them. This incident affected only tenants using the **AS1 cluster \(Singapore, Asia\)**. **Impact on our customers** We saw a big increase in our 50Xs errors. Specifically, the 503 HTTP error indicates that our servers are unavailable. In this case, replicas of one of our services went down and didn’t restore properly after an automatic restart. As a result, **the API hosted on the AS1** cluster ended up in the loop of restarting pods. That caused increased latency and possible request timeouts, which ultimately resulted in reduced availability for part of the API calls. A number of the redemption API requests didn't go through because of this issue. **Source of the problem** One of our customers leveraging an account hosted on the AS1 cluster suddenly started sending a massive count of API calls. Unfortunately, the count of calls was overwhelming the limit allocated to his account, and our rate-limiter implemented in the API gateway didn't act properly. The customer was always invoking the same API method. Unfortunately, for that path used by that customer, we had a bug in the API gateway \(service which, among others, is responsible for controlling authentification and rate-limiting\), resulting in a memory leak affecting the API gateway. At the same time, our auto-scaling mechanism reacted too late and didn't set up additional resources as fast as was required. **Improvements** First of all, we have manually rescaled the resources to improve the responsiveness of our API gateway. The main problem was mitigated after that action. However, it was just a temporary solution, and our primary goal was to solve the memory leak issue as soon as possible. We had released the final improvement within 4 hours after the problem occurrence. We fully solved the problem at 12 pm UTC. In the meantime, we reached out to a customer who was abusing API to notify him about incorrect integration. As an ultimate improvement, we reconfigured the autoscaling mechanism on the AS1 cluster. After a couple of tests, we identified a proper set of parameters that should help us to react in similar cases in the future to keep the API gateway alive for a reasonable time window after recognizing the potential \(memory leak\) issue. **Summary** We understand how critical our infrastructure is for our customers’ businesses, and so we will continue to move towards completely automated systems to deal with this type of incident. Our goal is to minimize disruptions and outages for our customers regardless of the origin of the issue.