Friday, March 9, 2018

Edge Topology Replication Failures Caused by Mismatched Windows Updates

While getting a new Skype for Business edge server ready for production, I generally make sure all the latest Windows Updates are applied. I was doing exactly this for the company I work for (Nectar Services Corp).  Everything seemed to go just fine, but I noticed the edge server's replication status was showing as False when I ran Get-CsManagementStoreReplicationStatus.  The usual procedure of running Invoke-CsManagementStoreReplication -ReplicaFQDN servername did nothing. Even wiping out the C:\RtcReplicaRoot\xds-replica folder as described in this dusty old blog post didn't make a difference.

The Event Logs on the front-end server that was the master replication partner showed these fairly frequent error events:
Log Name:      Lync Server
Source:        LS File Transfer Agent Service
Date:          2/22/2018 10:01:29 AM
Event ID:      1046
Task Category: (1121)
Level:         Error
Keywords:      Classic
User:          N/A
Skype for Business Server 2015, File Transfer Agent cannot send replication data to Replica Replicator Agent on Edge
Edge machine:
Exception: System.ServiceModel.EndpointNotFoundException: There was no endpoint listening at that could accept the message. This is often caused by an incorrect address or SOAP action. See InnerException, if present, for more details. ---> System.Net.WebException: Unable to connect to the remote server ---> System.Net.Sockets.SocketException: No connection could be made because the target machine actively refused it
I could reach the edge server's replication web service URL via port 4443 as described in the event log, so there wasn't a firewall issue or an issue with the web service that I could determine.

The edge server was also throwing errors, saying that it hadn't heard from any of the replication servers in a while and its feelings were very hurt.
Log Name:      Lync Server
Source:        LS Replica Replicator Agent Service
Date:          2/22/2018 12:04:32 PM
Event ID:      3045
Task Category: (3003)
Level:         Error
Keywords:      Classic
User:          N/A
The replication synthetic transaction has not been updated in a significant time period.
Time since the last update: 01.03:43:09
Cause: The Master Replicator Agent has not updated the replication transaction document in a significant time period.
If other replicas are experiencing similar issues, check the Master Replicator Agent and File Transfer Agent service health.  Verify access to the DFS files shares and replica file shares
The weird thing was that replication WAS working prior to me installing the final batch of Windows Updates. So, it seemed to make sense to uninstall each of the last Windows Update until things start working again. Unfortunately, after uninstalling those updates, along with a bunch of other ones in an increasingly desperate attempt to find the issue, the issue still persisted. Back to the drawing board.

I noticed that I was unable to copy binary files to the edge server via RDP copy/paste. Text files would copy fine, but any other filetype would cause the RDP session to crash with "Unexpected Server Error".  This seemed to be related to the replication issue, because this also worked fine prior to installing those updates. This is a good time to note that to access the edge server, I had to first RDP to a front-end server, then RDP to the edge from there, because the firewall was blocking all other access. This would prove to be relevant later.

I went as far as uninstalling and reinstalling Skype for Business along with all the supporting components, but it STILL didn't work.

Frustration level: STRATOSPHERIC

Finally, I nuked the entire edge server from orbit (It's the only way to be sure) and had the edge server rebuilt from scratch. Once again, replication worked.

End of story, right? Wrong. I had to figure out what the issue was, so I re-applied each of the original suspected updates until the issue re-appeared. And re-appear it did, along with the RDP copy/paste issue. The offensive update that caused the issue was 4072650, which is vaguely described as Hyper-V integration components update for Windows virtual machines. The description wasn't much help, but it certainly had an effect. Once again, removing that patch didn't make the issue disappear.

Finally, I checked to see if the front-end servers had this truly despicable update, AND THEY DIDN'T.  It was buried in Windows Update as an optional update, which weren't being automatically downloaded and installed. On a hunch, I installed the update (no restart required, thankfully), and VOILA, replication started working again, and I could copy/paste files between front-ends and edge via RDP.

So, this vague Windows Update was the source of all my issues. Sigh... Presumably, this would only show up in the following circumstances:

  • Using Hyper-V hosted virtual machines
  • Running Windows 2012 R2
  • Update 4072650 is installed on some VMs but not all

TL;DR version

If you're having replication errors on your edge server, and are getting LS File Transfer Agent Service error 1046 on your master replication server, then make sure that all front-end and edge servers have Update 4072650 if you see it applied on any one server.