Gridly incident

Degraded performance on API requests

Gridly experienced a major incident on October 4, 2021 affecting API Requests, lasting 2h 9m. The incident has been resolved; the full update timeline is below.

Started: Oct 04, 2021, 01:09 PM UTC
Resolved: Oct 04, 2021, 03:19 PM UTC
Duration: 2h 9m
Detected by Pingoru: Oct 04, 2021, 01:09 PM UTC

Affected components

API Requests

Update timeline

investigating Oct 04, 2021, 01:09 PM UTC

We are currently investigating this issue.
investigating Oct 04, 2021, 01:35 PM UTC

We're enabling maintenance mode for working on internal services. This is unexpected maintenance in less than 30 minutes from now.
investigating Oct 04, 2021, 02:16 PM UTC

We will keep the maintenance mode for next 15 mins.We will provide updates as necessary.
monitoring Oct 04, 2021, 02:36 PM UTC

A fix has been implemented and we are monitoring the results.
resolved Oct 04, 2021, 03:19 PM UTC

Infrastructure workaround has been implemented and the service is operating normally. We have identified the cause for the issue and are working towards a resolution. We will provide post-mortem shortly.
postmortem Oct 05, 2021, 03:47 AM UTC

### **Impact** * Major incident * Degraded performance on [api.gridly.com](http://api.gridly.com), gridly is running very slowly. ### **Timeline** ##### 2021-10-04 UTC * 01:09 PM - Degraded performance on [api.gridly.com](http://api.gridly.com) * 01:35 PM - Enable maintenance mode for 45 minutes for upgrading internal database * 02:36 PM - API is back to normal. ##### 2021-10-05 UTC * 01:50 AM - Partial outage on [api.gridly.com](http://api.gridly.com), some internal services has been down for 2-3 minutes * 02:40 AM - Deployed hotfix to production. API is back to normal. ### **Root cause analysis \(RCA\)** * We had unexpected downtime from our internal service since [Sep 29, 2021](https://status.gridly.com/incidents/376gfqpsg519), related to license service \(plan, seat & subscription\). At this time, we scaled out to increase High Availability. * Our database was running under pressure because of high traffic, it’s still working but the operations & response time from database are very slowly, that’s why we experienced degraded performance on some API endpoints. * We scaled up & upgraded hardware specification on database side to help reducing workload & impact. * From perf insight & error tracking, we identified the root cause, it’s about blockers during processing tasks. * After identified the root cause, we deployed hotfix for this, optimize some logics on async. * All is back to normal, continue monitoring this kind of issue for next few days