Level365 incident
Intermittent Registration Issues with Yealink Phones in Some Instances
Level365 experienced a notice incident on August 6, 2020 affecting Core UCaaS Services, lasting 3h 4m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- identified Aug 06, 2020, 01:26 PM UTC
We have identified a situation where in some cases Yealink phones will fail to register properly when behind an EdgeMarc session border controller. We are investigating.
- identified Aug 06, 2020, 01:27 PM UTC
We are currently rolling out a fix for this issue.
- monitoring Aug 06, 2020, 02:02 PM UTC
Fix has been implemented and we are seeing all registrations completing normally. We will continue to monitor.
- resolved Aug 06, 2020, 02:35 PM UTC
This issue is now considered resolved. If you are still experiencing issues, please open a ticket at https://support.level365.com
- postmortem Aug 07, 2020, 06:03 PM UTC
## Overview Wednesday night, an upgrade was rolled out across our data centers. This upgrade had been tested thoroughly prior to rollout, but unfortunately an issue was discovered after the fact that was only triggered when a certain set of parameters were met, and unfortunately this particular case wasn’t discovered in advance. While engineering worked on the root cause of the issue, the technical support department began rolling out temporary remediation steps to all impacted customers. We had restored service for approximately 75% of impacted customers when engineering discovered the cause of the issue and implemented a global fix. ## Observations After studying this issue from before the upgrade during the testing phase through the life of the incident, we have identified several areas we can improve on: ### Testing We will improve our testing methods, including adding unit testing to include all edge and corner cases that can be identified. This will allow us to discover any issues. This unit testing will also be run before and after any upgrades in the future, allowing us to compare pre- and post-upgrade results. ### Communication We didn’t do the best job of communicating the incident and its status through the incident lifecycle. In any future incidents, we will be assigning a Communications Lead whose sole responsibility will be to communicate regular updates to customers as well as keep the status page updated as frequently as possible. ### Breaking Changes We have discussed this issue with engineering and emphasized how important it is to communicate any potential breaking changes with code changes during an upgrade. Just because something _shouldn’t_ break something doesn’t mean that it won’t, and if it’s not documented somewhere accessible then the remediation process slows down. ## Details This particular issue only impacted customers who utilize both an EdgeMarc Session Border Controller and Yealink phones. The SIP registration servers were rejecting registrations when these two requirements were met, reporting “Duplicate Headers”. _Why_ was a little more difficult to determine. There were 2 questions that needed answered: what headers were duplicated., and what changed during the upgrade? A tcpdump quickly determined that the the duplicate header was an `Allow-Events` header that was being inserted by the EdgeMarc. After further research, it appeared that there was a default setting on the EdgeMarcs to help with local call survivability by requesting a certain class of events: `Allow-Events: BroadWorksSubscriberData`. This only seemed to be an issue with Yealink phones. Polycom phones seem to have their headers appended properly with the `BroadWorksSubscriberData` events class. Once we discovered what was being duplicated and how to disable that default option we began to roll out mitigation, changing this setting on all affected customers. At this same time, we were working with engineering to try to answer the second question: what changed? We discovered that engineering changed SIP packet processing in an attempt to tighten up security and filter out non-RFC data. This combined with the EdgeMarcs duplicating one of the SIP Headers caused an issue that was previously not an issue. We finally discovered that this functionality was new and defaulted to enabled. We then disabled this functionality with the information provided by engineering and the problem was resolved.