Thursday, January 16, 2014

High Processor Utilization on Lync 2013 Front-End Servers

We have a customer who is about to migrate from Lync 2010 to Lync 2013.  They've got a few lightly loaded Lync 2013 Enterprise Edition pools with 3 servers each.  All are running Windows 2008 R2 Standard Edition on VMWare.  All patches are up-to-date.

For inexplicable reasons, some of the servers will suddenly see their processor utilization spike to near 100% for extended periods of time, when their typical utilization is less than 5%. A look at Task Manager shows two instances of the W3WP.exe service (IIS web service) that are consuming large amounts of processor resources.  There are no events in the Event Logs to indicate an issue.

Performing an IISReset on the affected node makes the processor go back to normal, but this is obviously not a real solution.  We opened a ticket with Microsoft PSS, and they confirmed there are others seeing the same thing.  It seems the source of the problem is the "garbage collection" process in the LyncIntFeature and LyncExtFeature application pools in IIS.  Recycling those pools makes processor utilization return to normal (for a while at least).

Microsoft is actively working to resolve the issue, and I will post a permanent solution for all to see as soon as one becomes available.

UPDATE:  Thanks to @dannydpa  on Twitter, it appears the trigger may be Lync topology publishing. I confirmed this by updating the topology and publishing it.  Less than 10 minutes later, all the servers processor utilization spiked.  Recycling the aforementioned apppools resolved the issue.

To help others with this issue, I've created a little Powershell script that will recycle the LyncIntFeature and LyncExtFeature app pools for all servers in the pool hosting the Central Management Store.

$CMPool = (Get-CSService -CentralManagement | Where-Object {$_.Active}).PoolFQDN
$CMMembers = (Get-CSPool $CMPool).Computers
Foreach ($Computer in $CMMembers)
{
$Session = New-PSSession -ComputerName $Computer
Invoke-Command -session $Session -ScriptBlock {Restart-WebAppPool LyncExtFeature}
Invoke-Command -session $Session -ScriptBlock {Restart-WebAppPool LyncIntFeature}
Remove-PSSession $Session
}

8 comments:

  1. The CPU spike only happens when adding or removing an object from Topology. Changing a value does not cause this. It also appears to be related some how to response groups.

    If these at VMs, disable NUMA support.

    ReplyDelete
  2. We have NUMA spanning disabled and also made sure there is no CPU over commit. We have a little over 100 RGS workflows on our pool. We created all RGS via PowerShell on our new Lync 2013 pool. We never migrated them from Lync 2010 to Lync 2013. Microsoft is investigating new traces. Hopefully they find something.

    ReplyDelete
    Replies
    1. Seems as though MS is homing in on response groups, but we had just one for testing. Probably not the cause in our case.

      Ken

      Delete
  3. From what I am seeing, it only affects the pool that hosts the CMS as well. We have no Response Groups but it appears an addition to the topology caused it.

    ReplyDelete
    Replies
    1. We saw this happen on the same pool in December, and it wasn't the CMS at the time. It is now, but I suspect the CMS isn't part of the issue.

      Delete
  4. Had the same issue and opened a ticket with MS.
    They saw nothing suspicious, but CPU was around 60% at all times.
    Figured out the Call Park Service consumes up to 70% CPU at times, and the worse part is that it's not being used at all.
    Disabled CP in all policies, same issue. Restarted the service, same issue.
    Only fixed after I manually removed it from "Programs and Features" and stopped the service.
    CPU is now around 10-15% at most times.

    ReplyDelete
  5. We have had this high CPU spike on the W3WP processes when publishing topology changes. We have had this issue though before we upgraded to 2013. Shortly after installing our first 2013 pool we had the issue again but it did seem to have an impact on other 2010 servers in the pool and it was not just restricted to servers hosting CMS. It does not seem to happen every time - we have made several topology additions without having the issue.
    I have just had the issue now after the topology was modified to remove a 2010 pool after migration. In this instance all four cores on the CMS hosting server were at 100% with two of the eight or so w3wp.exe services consuming the resource between them. One strange anecdote to this instance is that I only noticed it because my remote powershell session timed out connecting - and I only noticed that because I was attempting to access the rgsconfig page on a 2010 pool. I checked the CPU on the 2010 front end and it was really low but there was a RGS error in the Lync event log and an ASP.net Error in the Application Event log.

    Mike Dickin
    Hempel

    ReplyDelete
  6. Problem Solved:
    I had this problem on a 2 node Enterprise FE pool. It started on one of the front ends, and then spread to the second one. After a lot of investigation, the solution was to apply the SQL 2012 Express SP1 on both the LYNCLOCAL and the RTCLOCAL instances. The LYNCLOCAL instance would not install using the unattended install but did install using the GUI install.

    ReplyDelete