Monday, March 12, 2018

The Evolution of the Skype Optimizer: From Locally-Run VBScript to Azure Web App

I was recently amazed when I realized the seeds of what is now the Skype Optimizer was created 10 YEARS AGO, back when Office Communications Server 2007 R2 was starting to make headway in the unified communications space.

The Beginning

The genesis of the program grew out of a need to know when to strip the +1 from a North American local number and when not to. Rather than rehash the creation myth, you can read all about it from one of my earliest blog posts where I announced the Dialing Rule Optimizer to the world (at that time, the 30-odd subscribers to my blog).

The very first version was a straight VBScript that I had to manually input the proper variables and run it by hand. Rather than give the code away, I told people to email me the phone numbers they wanted to get optimized dial rules for.  I would run the script and send them the results, which was a simple text file with either a bunch of regex, or text formatted to be applied to AudioCodes or Dialogic gateways.  Word got around within Microsoft, and I found myself busy sending stuff to various Microsoft consultants.

When Lync 2010 came around, which was in its earliest days known as Communications Server "14", I added the capability to create simple routing rules that consisted of a few lines of PowerShell code.  I also wrapped the code around a simple UI in something called an HTA (short for HTML application).  It made generating rulesets easier for me, but it was still something that I was running from my local machine.
The earliest known copy of the original Dialing Rule Optimizer. I obtained this from the Smithsonian Museum. The text-only v1.0 has been lost to the sands of time.
I soon figured out that it would be relatively straightforward to move the HTA into an actual web page.  I put the code on a web server hosted by the company I was working for at the time and opened up the tool to the entire world. I actually put this code on the computer that was running our OCS 2007 R2 server!

The very first web-based iteration of the Dialing Rule Optimizer. Note the Communications Server "14" logo on the top-right.

Once Communications Server "14" became Lync 2010, I realized that I could go beyond simple optimized route creation, and modified the Optimizer to create everything required for a simple Enterprise Voice setup for US and Canada deployments.

Shortly after, I realized that I could do the same for other countries as well. The Optimizer interface grew somewhat to accommodate the requirements for different countries.
Dramatic differences abound! Communications Server "14" has changed to "Microsoft Lync". Also, UK dial plans!
I slowly added other countries to the Optimizer. I also added other features such as extension dialing rules, least-cost/failover routing, among many others.

Over time, the back-end code base was starting to become difficult to support. I was using a series of XML files to deal with languages and country-specific dialrules, and the sheer number of them was becoming cumbersome to manage.  I decided to move everything from the company-hosted platform to Amazon Web Services. I built a single Windows VM with SQL Express and ported the XML files to a database. It worked well, but AWS was starting to cost a fair bit to run for a free service. Donations were not keeping pace with costs.

I then discovered that Microsoft MVPs got a monthly allotment of funds in Azure. I immediately moved my infrastructure to Azure, where it ran mostly trouble-free for the next several years.

The Optimizer featureset grew and grew, but the interface was still as ugly as the day I first created it.  One person even suggested it looked like a GeoCities page. Hey, my argument was always that I was not (and am still not) a web developer.

The Modern Era

I decided to try to give the Optimizer a more modern look. I completely re-wrote the front-end code using Notepad++ as my trusty editor.  I replaced the clunky extension builder with a better Javascript framework that emulated an Excel spreadsheet, and made other significant under-the-hood improvements. After much trial and error, I was pleased to unveil the new look.

The website looked modern, clean and easy-to-use.  However, it bugged me that while the site was running in Azure, it was still just a single Windows VM with a local SQL Server instance. It was also costing most of my monthly Azure MVP credits with not a lot of headroom. I decided to try to make the Skype Optimizer simpler and cheaper to manage, figuring that there would probably come a day when I would not be a Microsoft MVP (the horror!!!) and I'd be expected to pay a monthly bill (Oh the humanity!!!).

My first step was to dump the local SQL and move to Azure SQL Database. First, I had to copy the gigabytes of data to my SQL instance.  I opted to use transactional replication, which would allow me to keep both my local and Azure-based SQL instances up-to-date while I tested things out.  It turned out to be ridiculously easy. I had to modernize my code a bit to allow it to still read/write data to Azure SQL, but this was pretty straightforward as well.

With that hurdle out of the way, I looked at a few different ways to further reduce my costs and administrative burden. 

Docker Containers

I'd heard about Docker containers and how it was an easy way to reduce the overall complexity and costs over a traditional virtual machine.  I installed Docker on my home machine and started messing around with it. I used the Image2Docker tool to make a copy of my Windows VM-based website and installed it locally.  I had to do quite a bit of modifications to my dockerfile to support some of the added features that Image2Docker didn't capture, but after a while, I managed to make the Skype Optimizer work in a Docker container. 

Moving my newly-created container to Azure wasn't too much work, but there was definitely a learning curve involved. It started up fine, and I pointed my DNS entries to the container and away we went!  However, all was not perfect:
  1. My container stopped working a few times over the span of a few weeks. Troubleshooting this proved to be nearly impossible due to the nature of Docker containers and how you lose the previous state every time you restart it. Either that or I just don't know how these things really work.
  2. Making code modifications wasn't simple either. I'd have to make the change in my local Docker image and publish that image to Azure. The startup process took 5-10 minutes, which was probably due to how I built my container.  I looked at ways to improve the startup time, but it was already eating up lots of my time.
  3. The costs to run the container wasn't much cheaper than a full VM
Because of those reasons, I decided that Docker containers weren't well-suited to my needs and I finally turned to....

Azure Web App

All the back-end code changes I made to the Optimizer to support both Azure SQL Database and Docker containers actually had an unexpected side benefit: it allowed me to easily move the Optimizer to Azure Web App, which is Azure's web hosting framework.

To make the process of managing this easier, I finally moved away from Notepad++ as a development environment and embraced Visual Studio. Visual Studio made it trivial to take my entire website and migrate it to Azure. I had a few challenges with making my code-signing certificate work, but in the end it all worked flawlessly. 

The Skype Optimizer has been running as an Azure Web App for several months now. Its extremely reliable, simple to manage, and about 3 times cheaper to run than the original Windows VM. 

The Future

So, there you have it. The entire history of the Skype Optimizer posted here for posterity. Where do things go from here? Well, with Microsoft Teams eventually taking over the Enterprise Voice role from Skype for Business, I may decide to look into what it would take to turn the Skype Optimizer into a Teams Direct Routing and Calling Plans management platform. 

Until then, I will continue to keep updating the Skype Optimizer to make sure Skype and Teams administrators worldwide have a single place to get accurate, up-to-date dialing rules for every country in the world.

Friday, March 9, 2018

Edge Topology Replication Failures Caused by Mismatched Windows Updates

While getting a new Skype for Business edge server ready for production, I generally make sure all the latest Windows Updates are applied. I was doing exactly this for the company I work for (Nectar Services Corp).  Everything seemed to go just fine, but I noticed the edge server's replication status was showing as False when I ran Get-CsManagementStoreReplicationStatus.  The usual procedure of running Invoke-CsManagementStoreReplication -ReplicaFQDN servername did nothing. Even wiping out the C:\RtcReplicaRoot\xds-replica folder as described in this dusty old blog post didn't make a difference.

The Event Logs on the front-end server that was the master replication partner showed these fairly frequent error events:
Log Name:      Lync Server
Source:        LS File Transfer Agent Service
Date:          2/22/2018 10:01:29 AM
Event ID:      1046
Task Category: (1121)
Level:         Error
Keywords:      Classic
User:          N/A
Skype for Business Server 2015, File Transfer Agent cannot send replication data to Replica Replicator Agent on Edge
Edge machine:
Exception: System.ServiceModel.EndpointNotFoundException: There was no endpoint listening at that could accept the message. This is often caused by an incorrect address or SOAP action. See InnerException, if present, for more details. ---> System.Net.WebException: Unable to connect to the remote server ---> System.Net.Sockets.SocketException: No connection could be made because the target machine actively refused it
I could reach the edge server's replication web service URL via port 4443 as described in the event log, so there wasn't a firewall issue or an issue with the web service that I could determine.

The edge server was also throwing errors, saying that it hadn't heard from any of the replication servers in a while and its feelings were very hurt.
Log Name:      Lync Server
Source:        LS Replica Replicator Agent Service
Date:          2/22/2018 12:04:32 PM
Event ID:      3045
Task Category: (3003)
Level:         Error
Keywords:      Classic
User:          N/A
The replication synthetic transaction has not been updated in a significant time period.
Time since the last update: 01.03:43:09
Cause: The Master Replicator Agent has not updated the replication transaction document in a significant time period.
If other replicas are experiencing similar issues, check the Master Replicator Agent and File Transfer Agent service health.  Verify access to the DFS files shares and replica file shares
The weird thing was that replication WAS working prior to me installing the final batch of Windows Updates. So, it seemed to make sense to uninstall each of the last Windows Update until things start working again. Unfortunately, after uninstalling those updates, along with a bunch of other ones in an increasingly desperate attempt to find the issue, the issue still persisted. Back to the drawing board.

I noticed that I was unable to copy binary files to the edge server via RDP copy/paste. Text files would copy fine, but any other filetype would cause the RDP session to crash with "Unexpected Server Error".  This seemed to be related to the replication issue, because this also worked fine prior to installing those updates. This is a good time to note that to access the edge server, I had to first RDP to a front-end server, then RDP to the edge from there, because the firewall was blocking all other access. This would prove to be relevant later.

I went as far as uninstalling and reinstalling Skype for Business along with all the supporting components, but it STILL didn't work.

Frustration level: STRATOSPHERIC

Finally, I nuked the entire edge server from orbit (It's the only way to be sure) and had the edge server rebuilt from scratch. Once again, replication worked.

End of story, right? Wrong. I had to figure out what the issue was, so I re-applied each of the original suspected updates until the issue re-appeared. And re-appear it did, along with the RDP copy/paste issue. The offensive update that caused the issue was 4072650, which is vaguely described as Hyper-V integration components update for Windows virtual machines. The description wasn't much help, but it certainly had an effect. Once again, removing that patch didn't make the issue disappear.

Finally, I checked to see if the front-end servers had this truly despicable update, AND THEY DIDN'T.  It was buried in Windows Update as an optional update, which weren't being automatically downloaded and installed. On a hunch, I installed the update (no restart required, thankfully), and VOILA, replication started working again, and I could copy/paste files between front-ends and edge via RDP.

So, this vague Windows Update was the source of all my issues. Sigh... Presumably, this would only show up in the following circumstances:

  • Using Hyper-V hosted virtual machines
  • Running Windows 2012 R2
  • Update 4072650 is installed on some VMs but not all

TL;DR version

If you're having replication errors on your edge server, and are getting LS File Transfer Agent Service error 1046 on your master replication server, then make sure that all front-end and edge servers have Update 4072650 if you see it applied on any one server.