InEvent experienced a major incident on September 13, 2021 affecting Firebase Web Sockets, lasting 7h 29m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Sep 13, 2021, 02:48 PM UTC
We are currently seeing a slow scalability response from the InEvent Firebase socket group. We are currently investigating the issue with Google Firebase team.
- identified Sep 13, 2021, 02:55 PM UTC
We have identified the slowness from the Firebase product. We are deploying an alternative with the Native websockets, under Company > Tools. We are currently deploying a fix for UI improvements on the Firebase UI console while Firebase is not responsive. Chat may be offline while the fix is being applied, but video and streaming should work normally.
- identified Sep 13, 2021, 03:45 PM UTC
We currently fixed an autoscaling Redis instance that was not able to redirect the load to a separate system. Redis is used by InEvent to quickly balance its write operations instead of relying only on a traditional SQL database. The following components are still affected: Firebase Web Sockets, which covers the live chat on the Virtual Lobby.
- monitoring Sep 13, 2021, 05:07 PM UTC
The team has concluded the implementation of all temporary solutions on the platform. This includes creating a timeout option on slow connecting sockets and also disabled Redis as a single connection. The web socket chat will remain with limited chat support on the firebase instance until we add support for per instance local connection, which should happen next week. The Redis cache team will be implementing a new permanent solution end of this week, Friday at the latest.
- resolved Sep 13, 2021, 09:15 PM UTC
This incident has been resolved.
- postmortem Sep 13, 2021, 09:16 PM UTC
## What is a “Websocket”? Without getting into too much detail, a **Websocket** is a method of network communication that is used for real time applications. We use **Websocket** for real time communication and interaction, and these are the modules that uses the **Websocket** service: * News Feed; * Inbox; * Session Chat; * Session Q&A; * Networking; * Creation of Group Rooms; * Invitations; * Push Notifications; * Live updates \(session settings changes\); * Networking Roulette; ## Issues with Native Websocket and Regular Websocket \(Google Firebase\) We have two Websocket providers, Google Firebase \(realtime database\) and our own implementation \(Native Websocket\). Today we had a large amount of users connecting at the same time and this caused the Websocket servers to halt. Google Firebase couldn’t scale fast enough and the Native Websockets couldn’t handle the scale either. The issue resulted on users having the “Connecting” popup showing up and never disappearing. ## Issues with Caching Server \(Redis\) We had a major outage with our Caching server \(Redis\) that caused the entire platform and backend to go offline. The Redis server clogged up and couldn’t handle the server scaling and load, and this resulted in an overall failure of the platform. The landing page and login page were still operational. ## Fixes implemented For **Native Websockets** we have implemented a manual scale for now and we will work on the autoscaling mechanism to support a large load in the future. If the connection fails, you will still be able to use the Virtual Lobby normally, with limited interaction. For **Google Firebase** we couldn’t implement a fix. We will try sharding the entire operation into multiple micro services for different modules \(Chat, Q&A, etc\), but since they don’t support replicas, it will be hard to scale on large events. If your event has more than 5,000 users, it’s better to use **Native Websockets**. If the connection fails, you will still be able to use the Virtual Lobby normally, with limited interaction. For **Caching Server \(redis\)** we are still implementing a fix but we did deploy a temporary workload that should replace the caching server for now and keep the platform and the backend stable. This is an internal fix and shouldn’t affect the user experience. ## What to expect for this week The platform backend and all its modules should be operational. In case **Native Websockets** or **Google Firebase** fails, you will still be able to access the platform and the Virtual Lobby, but users will have a limited experience without realtime interactions – chat, Q&A and the other modules listed above will not be operational. We are constantly working on improvements and we will announce when we have both realtime **Websockets** fully functional for large events. Meanwhile, we can guarantee that the backend and the Virtual Lobby will be online – even in case of limited realtime experience.