Windows NLB and Application Level health check

Recently I came across a scenario where requirement was having Active Passive windows NLB.  So, if Active node experiences issue we should ensure that all the Application related services are stopped on that node and these services are started on passive node. To achieve the failover we need the health check. Windows NLB provides support for all network failures such as active node looses network connectivity, is shut down or crash. In all these cases windows NLB will start sending connections to passive node. However, what if application related service crashes, hangs or stops servicing the connections? Does windows NLB offers any options to initiate failover options with Application level health check?

Windows NLB is great feature for network load balancing, it has not evolved much since windows 2000 days and it does not address above questions directly. However, Microsoft does provide some sample script templates for Monitoring Application Level Health which can be enhanced as required.

I also came across another blog by DAVID TOSOFF, where he has provided a brilliant script which can be configured for application level health check and can be customized to work with any application for NLB. What more is it can also be configured to run as a service.

However, all i needed is to have a script which will check if certain services are running on Active node. If any of the service is not running, stop all services on Active Node, make it passive, start services or passive node and make it active. After looking for few pieces of code from different blogs and forums i came up with below Power shell script to accomplish this task.

In addition to the service health check and initiating failover, this script also provides basic health check and recovery for NLB cluster.

########################### NLB Health Check ##############################
#Define Nodes
$node1 = “Node1.Contoso.Com”
$node2 = “Node2.Contoso.Com”
#get NLB status on NLB Nodes
$Node1status = Get-WmiObject -Class MicrosoftNLB_Node -computername $node1 -namespace root\MicrosoftNLB |  where {$_.ComputerName -eq $node1} | Select-Object __Server, statuscode
$Node2status = Get-WmiObject -Class MicrosoftNLB_Node -computername $node2 -namespace root\MicrosoftNLB |  where {$_.ComputerName -eq $node2} | Select-Object __Server, statuscode

Function HealthCheck ([String]$Active, [String]$Passive)
{
#Create an array of all services running
$GetService = get-service -ComputerName $Active
#Write-Host “Checking Service on $Active ” -ForegroundColor Green
#Create a subset of the previous array for services you want to monitor
$ServiceArray = “Service1″,”Service2″,”Service3″,”Service4”;
#Find any iWFM service that is stopped
foreach ($Service in $GetService)
{
    foreach ($srv in $ServiceArray)
    {
        if ($Service.name -eq $srv)
        {
            #check if a service is hung
            if ($Service.status -eq “StopPending”)
            {
            #email to notify if a service is down
            #Send-Mailmessage -to admin@domain.com -Subject “$srv is hung on $Active” -from admin@domain.com -Body “The $srv service was found hung” -SmtpServer smtp.domain.com
            $servicePID = (gwmi win32_Service | where { $_.Name -eq $srv}).ProcessID
            Stop-Process $ServicePID
            }
            # check if a service is stopped
            elseif ($Service.status -eq “Stopped”)
            {
            #email to notify if a service is down
            #Send-Mailmessage -to admin@domain.com -Subject “$srv is stopped on $Active” -from admin@domain.com -Body “The $srv service was found stopped” -SmtpServer smtp.domain.com
            #Write-Host “$srv is stopped on $Active” -ForegroundColor Red
            if ( Test-Connection -ComputerName $Passive -Count 1 -ErrorAction SilentlyContinue )
                {
                Write-Host “$Passive is up” -ForegroundColor Magenta
                #Call Cleanup Function for Active to Passive
                Cleanup $Active $Passive
                }
            else
                {
                Write-Host “$Passive is down” -ForegroundColor Red
                #automatically restart the service.
                Start-Service -InputObject (get-Service -ComputerName $Active -Name $srv)
                }
            }
        }
    }
}
} # End of Function

Function Cleanup ([String]$Stop, [String]$Start)
{
    Import-Module NetworkLoadBalancingClusters
    #$services = “Service1″,”Service2″,”Service3″,”Service4”;
    #Invoke-Command -ComputerName $Stop -ScriptBlock { cd C:\users\Administrator.Contoso\Desktop; .\Services_stop.cmd}
    #Stop Services on Failed Node
        (gwmi win32_service -computername $Stop -filter “name=’Service1′”).stopservice()
        (gwmi win32_service -computername $Stop -filter “name=’Service2′”).stopservice()
        (gwmi win32_service -computername $Stop -filter “name=’Service3′”).stopservice()
        (gwmi win32_service -computername $Stop -filter “name=’Service4′”).stopservice()
    #Stop Failed NLB Node
    Stop-NlbClusterNode -HostName $Stop -Drain -Timeout 10
    #Invoke-Command -ComputerName $Start -ScriptBlock { cd C:\users\Administrator.Contoso\Desktop; .\Services_start.cmd}
    #Start Service on Active Node
        (gwmi win32_service -computername $Start -filter “name=’Service1′”).startservice()
        (gwmi win32_service -computername $Start -filter “name=’Service2′”).startservice()
        (gwmi win32_service -computername $Start -filter “name=’Service3′”).startservice()
        (gwmi win32_service -computername $Start -filter “name=’Service4′”).startservice()
    #Start Passive NLB Node
    Start-NlbClusterNode -HostName $Start
        #$result = (gwmi win32_service -computername $computer -filter “name=’$service'”).startservice()}
        #$result = (gwmi win32_service -computername $computer -filter “name=’$service'”).stopservice()
        #$result = (gwmi win32_service -computername $computer -filter “name=’$service'”).ChangeStartMode(“Disabled”)
        #$result = (gwmi win32_service -computername $computer -filter “name=’$service'”).ChangeStartMode(“Automatic”)
} #End of Function
IF ($node1status.statuscode -eq “1008” -or $node1status.statuscode -eq “1007”)
{
    write-host “NLB Status of $node1 is: Converged”  -ForegroundColor Green
    HealthCheck $node1 $node2
}
else
{
    write-host “NLB Status of $node1 is: Error”  -ForegroundColor Red
    IF ($node2status.statuscode -eq “1008” -or $node2status.statuscode -eq “1007”)
    {
    write-host “NLB Status of $node2 is: Converged”  -ForegroundColor Green
    #Write-Host “Passing HealthCheck with $node2, $node1” -ForegroundColor Green
    HealthCheck $node2 $node1
    }
    else
    {
    write-host “NLB Status of $node2 is: Error”  -ForegroundColor Red
        if ( Test-Connection -ComputerName $Node1 -Count 1 -ErrorAction SilentlyContinue )
                {
                Write-Host “$Node1 is up” -ForegroundColor Magenta
                #Call Cleanup Function for Active to Passive
                Start-NlbCluster -HostName Node1
                Cleanup $node2 $node1
                start-sleep -seconds 30
                }
            else
                {
                Write-Host “$Node1 is down” -ForegroundColor Red
                if ( Test-Connection -ComputerName $Node2 -Count 1 -ErrorAction SilentlyContinue )
                    {
                    Write-Host “$Node2 is up” -ForegroundColor Magenta
                    #Call Cleanup Function for Active to Passive
                    Start-NlbCluster -HostName Node2
                    Cleanup $node2 $node1
                    start-sleep -seconds 30
                    }
                else
                    {
                    Write-Host “$Node2 is down” -ForegroundColor Red
                    Write-Host “All nodes in NLB are DOWN !!!!!!!” -ForegroundColor Red
                    }
                }
    }
}

########################### NLB Health Check ##############################

Note: I am no expert when it comes to scripting and this is not a perfect script but it works. There is a lot of room for improvement and if you have any suggestions please help me make it better:)

Next thing is to run this script as a service to monitor the health of NLB. There are many ways to do it. Schtasks is a great utility to install your custom scripts as a service or you can use Instsrv.exe and Srvany.exe which are part of Windows Server 2003 Resource Kit Tools.

However, i found task scheduler a better fit for my scenario. There is a nice post on TechNet blog which explains the details of it. However if you want a quick version, below is the only command that you would need.

C:\schtasks /create /tn HealthCheck /tr “powershell -NoLogo -WindowStyle hidden -file NLB_Health_Check.ps1” /sc minute /mo 1 /ru System

Don’t forget to change the execution policy of power shell to unrestricted before you schedule the script. To do that run power shell as Administrator and run command,

Set-ExecutionPolicy โ€“ Unrestricted

With power shell you can enhance the power of windows NLB to host your standard applications for high availability with ease, Have Fun ๐Ÿ™‚

Case of crashing wbengine and system state backup…

While going through patch reports I noticed that 2 windows 2008 r2 sp2 servers had missed 2 patch cycles. Soon it was found that system state backup was not happening for these servers. No backup so no patching.

So I started with system state backup.

A simple command, wbadmin start systemstatebackup โ€“backuptarget:c: gave following error,

The Windows Backup engine could not be contacted. Retry the operation.
The RPC server is unavailable.

cmd

Quick look at event viewer revealed more, wbengine.exe crashing with ntdll.dll module

Log Name:      Application
Source:        Application Error
Event ID:      1000
Task Category: (100)
Level:         Error
Keywords:      Classic
User:          N/A
Description:
Faulting application name: wbengine.exe, version: 6.1.7601.17514, time stamp: 0x4ce79951
Faulting module name: ntdll.dll, version: 6.1.7601.17725, time stamp: 0x4ec4aa8e
Exception code: 0xc0000374
Fault offset: 0x00000000000c40f2

1000-1

We found VSS writers were stable and did not report any errors

%windir%\logs\windowsserverbackup did not reveal any logs

With no leads, we decided to treat this as faulting application and crashing dll scenario and patch both these files to latest.

Quick search on support.microsoft.com revealed couple of fixes matching our scenario

http://support.microsoft.com/kb/2182466 โ€œ2155347997 (0x8078001D)โ€ error code when you perform a system state backup operation in Windows 7 or in Windows Server 2008 R2

http://support.microsoft.com/kb/2512352 Windows Server Backup utility does not back up some newly created files in Windows 7 or in Windows Server 2008 R2

http://support.microsoft.com/kb/2545627 A multithreaded application might crash in Windows 7 or in Windows Server 2008 R2

With several other application crashing on ntdll.dll KB 2545627 was perfect fit for our server and being the latest KB 2512352 was selected.

image

After the updated we found issue with other apps failing on ntdll.dll was fixed but it made no difference to primary issues of failing backup. We noticed same event 1000, this time with higher DL version numbers.

Log Name:      Application
Source:        Application Error
Event ID:      1000
Task Category: (100)
Level:         Error
Keywords:      Classic
User:          N/A
Description:
Faulting application name: wbengine.exe, version: 6.1.7601.21667, time stamp: 0x4d65d41c
Faulting module name: ntdll.dll, version: 6.1.7601.21861, time stamp: 0x4ec4a6c2
Exception code: 0xc0000374
Fault offset: 0x0000000000c4192

1000-2

Looking again at %windir%\logs\windowsserverbackup revealed Wbadmin.etl file which i had missed earlier.

I used tracerpt to analyze etl file but it did not revealed any information about crash issue.

In the mean time onsite team also did sfc /scannow and reinstalled windows backup module but it did not made any difference.

Some more search on TechNet forum pointed out “Manage Engine Asset Explorer Agent” as possible cause. We had this egent installed on this server.

Quick check at installed date of this Agent and last successful backup confirmed that the same.

We uninstalled “Manage Engine Asset Explorer Agent” and were relieved the see that wbengine was not crashing anymore ๐Ÿ™‚

However, this time backup failed at enumeration of files,

Summary of backup:
——————
Backup of system state failed [date time]

Log of files successfully backed up
‘C:\Windows\Logs\WindowsServerBackup\SystemStateBackup date time.log’

Log of files for which backup failed
‘C:\Windows\Logs\WindowsServerBackup\SystemStateBackup_Error date time.log’

I found following event, but it did not helped much,

Event ID: 519
Description: The backup operation that started at โ€œTimeโ€ has failed to back up volume(s) . Please review the event details for a solution, and then rerun the backup operation once the issue is resolved.

image

Quite interestingly, running the system state backup for GUI via backup module revealed more detailed error,

Event ID: 517
Description: The backup operation that started at โ€œTimeโ€ has failed with following error code ‘2155347997’. Please review the event details for a solution, and then rerun the backup operation once the issue is resolved.

image

KB http://support.microsoft.com/kb/2182466 โ€œโ€œ2155347997 (0x8078001D)โ€ error code when you perform a system state backup operation in Windows 7 or in Windows Server 2008 R2โ€ reffer to the exact same issue. However we already had installed higher version of wbengine and this articles was not applicable for us anymore.

I also found interesting article http://networkadminkb.com/KB/a467/how-to-fix-windows-2008-r2-system-state-backup-fails.aspx which refers to the same issue for OS virtualized using VMWare ESX.

As the backup was failing during enumerating files i followed http://blogs.technet.com/b/askcore/archive/2010/06/18/reasons-why-the-error-enumeration-of-the-files-failed-may-occur-during-system-state-backup.aspx Reasons why the error Enumeration of the files failed may occur during System State backup.

Checking all Image Paths for correct value is pain in itself, and its further complicated by multiple valid syntaxes. Thanks to Tom Acker  for proving a nice and easy way to find invalid image paths with GetInvalidImagePath script.

Running this script revealed multiple image paths with space which needed to be enclosed in quotes and few more keys with incorrectly added forward slash โ€œ/โ€ in image paths.

Once Image Paths were cleaned, system state worked like a charm ๐Ÿ™‚

With valid backup available, now these servers are good the receive long awaited patches missed for previous and current cycle.