Friday, July 11, 2014

Lync 2013 Edge server bug when adding new server pools

I had this happen to me a few weeks ago, and a co-worker had it happen to him today, so I figure it should be mentioned since nobody on the Internet seems to have a proper fix for it.

I installed a new Lync 2013 Standard Edition server in an existing Lync 2013 environment with an established edge server pool.  I created a test account on the new pool and began running through the usual testing scenarios.

All seemed well and fine until we tested federated communications.  An external party could establish an IM or A/V session with the internal user homed on the new pool and see their presence, but the internal user couldn't see presence nor establish a new IM or A/V session with anybody outside the company.

Puzzling to be sure, since everybody else in the company was OK.  I first checked to see if I could telnet to the typical ports required for edge functionality, and all was good. I then logged onto the edge to see if there were any relevant events.

Lo and behold, there was this interesting little warning:
LS Protocol Stack Event 14402
Multiple incoming connections on internal edge from non-internal servers.
In the past 289 minutes the server received 4 incoming connections on internal edge from non-internal servers. The last one was from host newlyncfe.contoso.com.
Cause: This can happen if an internal server is not present in the list of internal servers on the Access Edge Server.
Resolution:
If the server is a valid one, you need to add it to the list of internal servers on the Access Edge Server. If the server is invalid, you may be under an attack from that server.
This left me scratching my head.  How do you add a Lync 2013 server to the "list of internal servers"???  I recalled back to the olden times of OCS, where you had to manually add every valid server name to the edge, but you don't do that anymore, thanks to the wonder of the Topology Builder.

Internet searches brought up several cobweb-covered webpages from the late 2000's, all of them relating to OCS.  No good.  I re-ran Setup on the edge server, hoping that would trigger something and make things all good, but nope.

Finally, I decided to restart the Lync Server Access Edge service (during a period of low activity of course).  Once it restarted, all my troubles went away.  The user on the new front-end was able to initiate sessions to external users, and the 14402 errors on the edge went away.

My next step was going to be blindly restarting other services followed by a last-ditch server reboot. Since that turned out to be unnecessary, if this saves someone from doing a server restart (which would fix the problem too), that's awesome.

So, while Lync normally does an excellent job of detecting new servers and functioning fine without service restarts or server reboots, this is one situation where its not doing that.  It seems the Access Edge service caches the list of valid internal servers at service startup, and doesn't refresh that cache until the next service restart.

Because this seems to be one of the few instances where a service restart is required to maintain connectivity with a new downstream server, I would classify this behaviour as a...