Boostlingo incident

Connectivity issues in North America

Boostlingo experienced a major incident on May 31, 2023 affecting Boostlingo SMS and Boostllingo Voice IVR and 1 more component, lasting 4h 30m. The incident has been resolved; the full update timeline is below.

Started: May 31, 2023, 06:25 PM UTC
Resolved: May 31, 2023, 10:55 PM UTC
Duration: 4h 30m
Detected by Pingoru: May 31, 2023, 06:25 PM UTC

Affected components

Boostlingo SMSBoostllingo Voice IVRBoostlingo Interpreter PortalBoostlingo Group RoomsBoostlingo EMail API v3Boostlingo Requestor PortalBoostlingo Communication REST APIBoostlingo Network Traversal ServiceBoostlingo Speech Recognition

Update timeline

investigating May 31, 2023, 06:25 PM UTC

We are investigating reports of connection issues and slow loading speeds across our US server.
identified May 31, 2023, 06:42 PM UTC

The issue has been identified and a fix is being implemented.
monitoring May 31, 2023, 07:17 PM UTC

A fix has been implemented and we are monitoring the results.
resolved May 31, 2023, 10:55 PM UTC

We have concluded our monitoring and confirmed that the correct action has been taken. We will update this incident with a post-mortem in the next 24-48 hours.
postmortem Jun 02, 2023, 09:34 PM UTC

Issues with SignalR service that was dropping connections causing our server to hit critical hot code paths, which then caused thread contention and thread pool exhaustion. Steps taken to resolve: 1. When a call is completed we previously use synchronous api calls to update that call record to communicate that it was hung up. These API calls in cases would wait for a response which created performance issues during high volume hours. Moving these API calls to async operations allows them to run in parallel and keep the application stable. 2. We optimized code paths to improve cpu utilization. 3. We added capacity to our real-time communication service, MS Azure SignalR, essentially scaling up the service to handle a higher volume of real-time connections and messages. 4. We made improvements to our database by adding or modifying four indexes which help speed up data retrieval for call specific tables. 5. We updated maintenance task on our DB which are crucial activities performed to optimize the performance of a database by organizing and refreshing index structures and statistics. 6. Deeper Monitoring, Tracing and Profiling tools implemented to help us identify root cause more efficiently, provide better visibility into our products and enable incident managers to respond before an issue has impact on important services.