Gridly experienced a major incident on October 4, 2021 affecting API Requests, lasting 2h 9m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Oct 04, 2021, 01:09 PM UTC
We are currently investigating this issue.
- investigating Oct 04, 2021, 01:35 PM UTC
We're enabling maintenance mode for working on internal services. This is unexpected maintenance in less than 30 minutes from now.
- investigating Oct 04, 2021, 02:16 PM UTC
We will keep the maintenance mode for next 15 mins.We will provide updates as necessary.
- monitoring Oct 04, 2021, 02:36 PM UTC
A fix has been implemented and we are monitoring the results.
- resolved Oct 04, 2021, 03:19 PM UTC
Infrastructure workaround has been implemented and the service is operating normally. We have identified the cause for the issue and are working towards a resolution. We will provide post-mortem shortly.
- postmortem Oct 05, 2021, 03:47 AM UTC
### **Impact** * Major incident * Degraded performance on [api.gridly.com](http://api.gridly.com), gridly is running very slowly. ### **Timeline** ##### 2021-10-04 UTC * 01:09 PM - Degraded performance on [api.gridly.com](http://api.gridly.com) * 01:35 PM - Enable maintenance mode for 45 minutes for upgrading internal database * 02:36 PM - API is back to normal. ##### 2021-10-05 UTC * 01:50 AM - Partial outage on [api.gridly.com](http://api.gridly.com), some internal services has been down for 2-3 minutes * 02:40 AM - Deployed hotfix to production. API is back to normal. ### **Root cause analysis \(RCA\)** * We had unexpected downtime from our internal service since [Sep 29, 2021](https://status.gridly.com/incidents/376gfqpsg519), related to license service \(plan, seat & subscription\). At this time, we scaled out to increase High Availability. * Our database was running under pressure because of high traffic, it’s still working but the operations & response time from database are very slowly, that’s why we experienced degraded performance on some API endpoints. * We scaled up & upgraded hardware specification on database side to help reducing workload & impact. * From perf insight & error tracking, we identified the root cause, it’s about blockers during processing tasks. * After identified the root cause, we deployed hotfix for this, optimize some logics on async. * All is back to normal, continue monitoring this kind of issue for next few days