2600Hz incident

Increased Reports Of Call Completion Issues in EWR

2600Hz experienced a minor incident on February 16, 2023 affecting Telephony Services, lasting 1h 1m. The incident has been resolved; the full update timeline is below.

Started: Feb 16, 2023, 09:12 PM UTC
Resolved: Feb 16, 2023, 10:14 PM UTC
Duration: 1h 1m
Detected by Pingoru: Feb 16, 2023, 09:12 PM UTC

Affected components

Telephony Services

Update timeline

investigating Feb 16, 2023, 09:33 PM UTC

We are receiving reports of call completion issues in EWR.
monitoring Feb 16, 2023, 09:35 PM UTC

We have paused the EWR zone in full and are redirecting traffic arriving in EWR to ORD in Kamillio. Call completion rates seem to be returning to normal levels
monitoring Feb 16, 2023, 09:46 PM UTC

We have had no further reports of issues and all alerts cleared. Please write into support if you have any example calls after 13:35 PT
resolved Feb 16, 2023, 10:14 PM UTC

Multiple clients are reporting no further issues
postmortem Feb 21, 2023, 10:35 PM UTC

On Thursday the 16th of February we experienced a Zswitch outage between 19:25-19:45 GMT. ‌ As soon as we noticed alerts appearing for disconnects a 911 bridge was started with Support/Operations/Engineering/ the CTO. ‌ This appeared to be localised to EWR; we therefore paused all EWR servers as a precaution which remediated the issues with calls. ‌ From here we troubleshooted to determine what could have caused the outage; it was realised compaction was started on the bc003.ewr server only a few minutes beforehand; we could also see that the load was unreasonably high on this server. The compaction was stopped which brought the load down to a stable level. We also noticed that this BigCouch server was delivering DB responses in the 5-6 second range, rather than the below 50ms we would expect. This was due to the server not behaving in the expected manner when compaction was running. ‌ Further tests were completed which helped us to determine this was an issue with the server itself and put a plan in to migrate it over to new hardware. ‌ After the migration was complete, we further stress tested the new server to confirm compaction did not cause any issues. It’s clear that this server was major component in the 4 most recent outages. ‌ We have added new alerting which will help us to have higher visibility of response times from all servers to Bigcouch as this will allow us to notice any similar issues before they start to impact. We are also further improving this alerting to exclude crossbar \(API\) calls which will help keep out any false positives, as well as adding similar active testing across all HA Layers in Kazoo. ‌ During the holiday weekend we staged a “All hands on deck” meeting to go over the outages we’ve seen and the steps we can put in place to improve. We want to convey that the seriousness of the downtime we have seen is understood and we’re continuing to work to prevent any recurrences. Following this call we have come out with a list of action items that we’ll be undertaking with the highest priority; including the aforementioned monitoring changes, and reprioritising engineering load to speed up the migration over to CouchDB 3 from BigCouch. ‌ If anyone has any further questions, or you’d like deeper information on what steps we’re taking to prevent further outages please don’t hesitate to get in touch.