Gridly incident

Degraded API performance and connection timeout

Gridly experienced a critical incident on September 7, 2021, lasting —. The incident has been resolved; the full update timeline is below.

Update timeline

resolved Sep 07, 2021, 10:23 PM UTC

We are currently investigating high rate of network connectivity failures to api.gridly.com. We have identified the cause for the issue and are working towards a resolution.
postmortem Sep 07, 2021, 10:26 PM UTC

### Impact * Critical incident * Outage on [api.gridly.com](http://api.gridly.com) for 3 hours and 8 minutes ### **Timeline on 2021-09-07 UTC** * 07:07 PM - High rate of network connectivity failures to [api.gridly.com](http://api.gridly.com) * 10:10 PM - Restart proxy layer & deploy hotfix * 10:15 AM - API is back to normal. ### **Root cause analysis \(RCA\)** * It’s small interruption from our internal load balancer between micro services, it can be the changes on IP or peering network broken, but it appears in very short time \(expect ~1min or less\) * We do not have properly configurations on our egde proxies yet, so that during facing small interruption from load balancer internally, nginx proxy was not working anymore. * Edge proxies down and lost connection. Outage on entire API endpoint * On [Aug 28](https://status.gridly.com/incidents/z1ltyd2vfjlr), we already have this kind of issue, we also deployed hotfix for that, but somehow it’s missing or not cover all the interruption cases * We also deployed new strategy plan for handling more interruption cases. * Continue monitoring this kind of issue for next few days