Callbridge incident
Callbridge Conferencing - Delayed Service Response
Callbridge experienced a major incident on November 22, 2024 affecting Join Calls by Internet and Online Meeting Room and 1 more component, lasting 2h 44m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- identified Nov 22, 2024, 07:19 PM UTC
We are currently experiencing intermittent delayed service response with the conferencing services. Our team is diligently working to resolve this issue. We will follow up with an email once the service has been restored.
- monitoring Nov 22, 2024, 07:36 PM UTC
A fix has been implemented and we are monitoring the results.
- monitoring Nov 22, 2024, 07:49 PM UTC
We are continuing to monitor for any further issues.
- identified Nov 22, 2024, 09:17 PM UTC
We continue to investigate the delayed response to the service. We will update the Status page as soon as a resolution is in place.
- resolved Nov 22, 2024, 10:04 PM UTC
We identified an issue caused by the Google Calendar add-on, which impacted the performance of our video conferencing platform. To resolve this issue, we have temporarily disabled some Google Calendar integration features. The core video conferencing platform is now fully functional, and we are working to restore the add-on features as quickly as possible. We sincerely apologize for any inconvenience caused and appreciate your understanding as we address this matter.
- postmortem Nov 26, 2024, 08:31 PM UTC
## Summary of Root Cause Google Workspace made some internal changes at around the beginning of November 2024, which changes the Google Calendar add-on behavior. We have observed that the add-on would resubmit the same meeting changes to our platform during event synchronization and caused a large increase in the number of API requests. Google’s platform also has a “request quota“ which was then triggered and resulted in blocking the requests to our platform. This in turn would queue up more requests for the next event batch to synchronize the changes. After a couple of weeks of random interruptions to the Google Calendar add-on, it reached a point where the concurrent requests to change existing meeting schedules are causing database deadlock and transaction rollbacks. The database was bogged down doing transaction rollbacks during the incident window causing the meeting scheduling and starting/joining meetings to time out, for some or even most users. ## Action Plan to Prevent Future Service Incidents * We have disabled all Google Calendar add-ons on 11/25 * We will rewrite how we process the Google Calendar updates so that all changes to a given account are serialized \(instead of updating meetings in parallel\) * We will collect logs to open a ticket with Google Workspace team to understand their changes ## Timeline \(UTC\) * 11/5 - 11/15: internal system alert showing increased amount of requests from Google Calendar add-on \(partner 1\) and elevated system stress. The affected meeting linked to most of the requests was flagged to be bypassed for future request processing. Several reports about issues using Google Calendar add-on \(multiple partners\). We have rolled out several changes to reduce the performance impact on the system resources caused by the large amount of requests. We also reached a conclusion that Google Workspace was blocking/throttling requests to us. * 11/13 17:40 - 18:35: the database experienced a deadlock, causing several components to timeout. We scaled up the system to increase capacity, meanwhile Google Workspace paused the requests to us. The combination resolved the service incident. * 11/22 19:10 - 22:00: the database experienced a deadlock, causing several components to timeout. We flushed all the system nodes to clear the deadlock, but it did not resolve the issue. We then identified and blocked 2 Google Calendar add-on that are sending the most requests and resolved the service incident. * 11/25 16:30 - 16:40: the database experienced a deadlock, causing the platform to be very slow. We decided to disable all Google Calendar add-on. ## Method of Discovery Customers started reporting unusual service behavior with their meetings. ## Scope of Impact Service was slow to react when trying to get to the meeting dashboard, starting a meeting, giving errors when scheduling a call. All in meeting features such as recording, chat and muting were unresponsive or very slow to respond. Some users were disconnected from their meetings. ## Resolution The Google Calendar Add-on was disabled stopping all API requests to synchronize events