Windows NLB and Application Level health check

Recently I came across a scenario where requirement was having Active Passive windows NLB.  So, if Active node experiences issue we should ensure that all the Application related services are stopped on that node and these services are started on passive node. To achieve the failover we need the health check. Windows NLB provides support for all network failures such as active node looses network connectivity, is shut down or crash. In all these cases windows NLB will start sending connections to passive node. However, what if application related service crashes, hangs or stops servicing the connections? Does windows NLB offers any options to initiate failover options with Application level health check?

Windows NLB is great feature for network load balancing, it has not evolved much since windows 2000 days and it does not address above questions directly. However, Microsoft does provide some sample script templates for Monitoring Application Level Health which can be enhanced as required.

I also came across another blog by DAVID TOSOFF, where he has provided a brilliant script which can be configured for application level health check and can be customized to work with any application for NLB. What more is it can also be configured to run as a service.

However, all i needed is to have a script which will check if certain services are running on Active node. If any of the service is not running, stop all services on Active Node, make it passive, start services or passive node and make it active. After looking for few pieces of code from different blogs and forums i came up with below Power shell script to accomplish this task.

In addition to the service health check and initiating failover, this script also provides basic health check and recovery for NLB cluster.

########################### NLB Health Check ##############################
#Define Nodes
$node1 = “Node1.Contoso.Com”
$node2 = “Node2.Contoso.Com”
#get NLB status on NLB Nodes
$Node1status = Get-WmiObject -Class MicrosoftNLB_Node -computername $node1 -namespace root\MicrosoftNLB |  where {$_.ComputerName -eq $node1} | Select-Object __Server, statuscode
$Node2status = Get-WmiObject -Class MicrosoftNLB_Node -computername $node2 -namespace root\MicrosoftNLB |  where {$_.ComputerName -eq $node2} | Select-Object __Server, statuscode

Function HealthCheck ([String]$Active, [String]$Passive)
{
#Create an array of all services running
$GetService = get-service -ComputerName $Active
#Write-Host “Checking Service on $Active ” -ForegroundColor Green
#Create a subset of the previous array for services you want to monitor
$ServiceArray = “Service1″,”Service2″,”Service3″,”Service4”;
#Find any iWFM service that is stopped
foreach ($Service in $GetService)
{
    foreach ($srv in $ServiceArray)
    {
        if ($Service.name -eq $srv)
        {
            #check if a service is hung
            if ($Service.status -eq “StopPending”)
            {
            #email to notify if a service is down
            #Send-Mailmessage -to admin@domain.com -Subject “$srv is hung on $Active” -from admin@domain.com -Body “The $srv service was found hung” -SmtpServer smtp.domain.com
            $servicePID = (gwmi win32_Service | where { $_.Name -eq $srv}).ProcessID
            Stop-Process $ServicePID
            }
            # check if a service is stopped
            elseif ($Service.status -eq “Stopped”)
            {
            #email to notify if a service is down
            #Send-Mailmessage -to admin@domain.com -Subject “$srv is stopped on $Active” -from admin@domain.com -Body “The $srv service was found stopped” -SmtpServer smtp.domain.com
            #Write-Host “$srv is stopped on $Active” -ForegroundColor Red
            if ( Test-Connection -ComputerName $Passive -Count 1 -ErrorAction SilentlyContinue )
                {
                Write-Host “$Passive is up” -ForegroundColor Magenta
                #Call Cleanup Function for Active to Passive
                Cleanup $Active $Passive
                }
            else
                {
                Write-Host “$Passive is down” -ForegroundColor Red
                #automatically restart the service.
                Start-Service -InputObject (get-Service -ComputerName $Active -Name $srv)
                }
            }
        }
    }
}
} # End of Function

Function Cleanup ([String]$Stop, [String]$Start)
{
    Import-Module NetworkLoadBalancingClusters
    #$services = “Service1″,”Service2″,”Service3″,”Service4”;
    #Invoke-Command -ComputerName $Stop -ScriptBlock { cd C:\users\Administrator.Contoso\Desktop; .\Services_stop.cmd}
    #Stop Services on Failed Node
        (gwmi win32_service -computername $Stop -filter “name=’Service1′”).stopservice()
        (gwmi win32_service -computername $Stop -filter “name=’Service2′”).stopservice()
        (gwmi win32_service -computername $Stop -filter “name=’Service3′”).stopservice()
        (gwmi win32_service -computername $Stop -filter “name=’Service4′”).stopservice()
    #Stop Failed NLB Node
    Stop-NlbClusterNode -HostName $Stop -Drain -Timeout 10
    #Invoke-Command -ComputerName $Start -ScriptBlock { cd C:\users\Administrator.Contoso\Desktop; .\Services_start.cmd}
    #Start Service on Active Node
        (gwmi win32_service -computername $Start -filter “name=’Service1′”).startservice()
        (gwmi win32_service -computername $Start -filter “name=’Service2′”).startservice()
        (gwmi win32_service -computername $Start -filter “name=’Service3′”).startservice()
        (gwmi win32_service -computername $Start -filter “name=’Service4′”).startservice()
    #Start Passive NLB Node
    Start-NlbClusterNode -HostName $Start
        #$result = (gwmi win32_service -computername $computer -filter “name=’$service'”).startservice()}
        #$result = (gwmi win32_service -computername $computer -filter “name=’$service'”).stopservice()
        #$result = (gwmi win32_service -computername $computer -filter “name=’$service'”).ChangeStartMode(“Disabled”)
        #$result = (gwmi win32_service -computername $computer -filter “name=’$service'”).ChangeStartMode(“Automatic”)
} #End of Function
IF ($node1status.statuscode -eq “1008” -or $node1status.statuscode -eq “1007”)
{
    write-host “NLB Status of $node1 is: Converged”  -ForegroundColor Green
    HealthCheck $node1 $node2
}
else
{
    write-host “NLB Status of $node1 is: Error”  -ForegroundColor Red
    IF ($node2status.statuscode -eq “1008” -or $node2status.statuscode -eq “1007”)
    {
    write-host “NLB Status of $node2 is: Converged”  -ForegroundColor Green
    #Write-Host “Passing HealthCheck with $node2, $node1” -ForegroundColor Green
    HealthCheck $node2 $node1
    }
    else
    {
    write-host “NLB Status of $node2 is: Error”  -ForegroundColor Red
        if ( Test-Connection -ComputerName $Node1 -Count 1 -ErrorAction SilentlyContinue )
                {
                Write-Host “$Node1 is up” -ForegroundColor Magenta
                #Call Cleanup Function for Active to Passive
                Start-NlbCluster -HostName Node1
                Cleanup $node2 $node1
                start-sleep -seconds 30
                }
            else
                {
                Write-Host “$Node1 is down” -ForegroundColor Red
                if ( Test-Connection -ComputerName $Node2 -Count 1 -ErrorAction SilentlyContinue )
                    {
                    Write-Host “$Node2 is up” -ForegroundColor Magenta
                    #Call Cleanup Function for Active to Passive
                    Start-NlbCluster -HostName Node2
                    Cleanup $node2 $node1
                    start-sleep -seconds 30
                    }
                else
                    {
                    Write-Host “$Node2 is down” -ForegroundColor Red
                    Write-Host “All nodes in NLB are DOWN !!!!!!!” -ForegroundColor Red
                    }
                }
    }
}

########################### NLB Health Check ##############################

Note: I am no expert when it comes to scripting and this is not a perfect script but it works. There is a lot of room for improvement and if you have any suggestions please help me make it better:)

Next thing is to run this script as a service to monitor the health of NLB. There are many ways to do it. Schtasks is a great utility to install your custom scripts as a service or you can use Instsrv.exe and Srvany.exe which are part of Windows Server 2003 Resource Kit Tools.

However, i found task scheduler a better fit for my scenario. There is a nice post on TechNet blog which explains the details of it. However if you want a quick version, below is the only command that you would need.

C:\schtasks /create /tn HealthCheck /tr “powershell -NoLogo -WindowStyle hidden -file NLB_Health_Check.ps1” /sc minute /mo 1 /ru System

Don’t forget to change the execution policy of power shell to unrestricted before you schedule the script. To do that run power shell as Administrator and run command,

Set-ExecutionPolicy – Unrestricted

With power shell you can enhance the power of windows NLB to host your standard applications for high availability with ease, Have Fun 🙂

Advertisements

Case of crashing wbengine and system state backup…

While going through patch reports I noticed that 2 windows 2008 r2 sp2 servers had missed 2 patch cycles. Soon it was found that system state backup was not happening for these servers. No backup so no patching.

So I started with system state backup.

A simple command, wbadmin start systemstatebackup –backuptarget:c: gave following error,

The Windows Backup engine could not be contacted. Retry the operation.
The RPC server is unavailable.

cmd

Quick look at event viewer revealed more, wbengine.exe crashing with ntdll.dll module

Log Name:      Application
Source:        Application Error
Event ID:      1000
Task Category: (100)
Level:         Error
Keywords:      Classic
User:          N/A
Description:
Faulting application name: wbengine.exe, version: 6.1.7601.17514, time stamp: 0x4ce79951
Faulting module name: ntdll.dll, version: 6.1.7601.17725, time stamp: 0x4ec4aa8e
Exception code: 0xc0000374
Fault offset: 0x00000000000c40f2

1000-1

We found VSS writers were stable and did not report any errors

%windir%\logs\windowsserverbackup did not reveal any logs

With no leads, we decided to treat this as faulting application and crashing dll scenario and patch both these files to latest.

Quick search on support.microsoft.com revealed couple of fixes matching our scenario

http://support.microsoft.com/kb/2182466 “2155347997 (0x8078001D)” error code when you perform a system state backup operation in Windows 7 or in Windows Server 2008 R2

http://support.microsoft.com/kb/2512352 Windows Server Backup utility does not back up some newly created files in Windows 7 or in Windows Server 2008 R2

http://support.microsoft.com/kb/2545627 A multithreaded application might crash in Windows 7 or in Windows Server 2008 R2

With several other application crashing on ntdll.dll KB 2545627 was perfect fit for our server and being the latest KB 2512352 was selected.

image

After the updated we found issue with other apps failing on ntdll.dll was fixed but it made no difference to primary issues of failing backup. We noticed same event 1000, this time with higher DL version numbers.

Log Name:      Application
Source:        Application Error
Event ID:      1000
Task Category: (100)
Level:         Error
Keywords:      Classic
User:          N/A
Description:
Faulting application name: wbengine.exe, version: 6.1.7601.21667, time stamp: 0x4d65d41c
Faulting module name: ntdll.dll, version: 6.1.7601.21861, time stamp: 0x4ec4a6c2
Exception code: 0xc0000374
Fault offset: 0x0000000000c4192

1000-2

Looking again at %windir%\logs\windowsserverbackup revealed Wbadmin.etl file which i had missed earlier.

I used tracerpt to analyze etl file but it did not revealed any information about crash issue.

In the mean time onsite team also did sfc /scannow and reinstalled windows backup module but it did not made any difference.

Some more search on TechNet forum pointed out “Manage Engine Asset Explorer Agent” as possible cause. We had this egent installed on this server.

Quick check at installed date of this Agent and last successful backup confirmed that the same.

We uninstalled “Manage Engine Asset Explorer Agent” and were relieved the see that wbengine was not crashing anymore 🙂

However, this time backup failed at enumeration of files,

Summary of backup:
——————
Backup of system state failed [date time]

Log of files successfully backed up
‘C:\Windows\Logs\WindowsServerBackup\SystemStateBackup date time.log’

Log of files for which backup failed
‘C:\Windows\Logs\WindowsServerBackup\SystemStateBackup_Error date time.log’

I found following event, but it did not helped much,

Event ID: 519
Description: The backup operation that started at “Time” has failed to back up volume(s) . Please review the event details for a solution, and then rerun the backup operation once the issue is resolved.

image

Quite interestingly, running the system state backup for GUI via backup module revealed more detailed error,

Event ID: 517
Description: The backup operation that started at “Time” has failed with following error code ‘2155347997’. Please review the event details for a solution, and then rerun the backup operation once the issue is resolved.

image

KB http://support.microsoft.com/kb/2182466 ““2155347997 (0x8078001D)” error code when you perform a system state backup operation in Windows 7 or in Windows Server 2008 R2” reffer to the exact same issue. However we already had installed higher version of wbengine and this articles was not applicable for us anymore.

I also found interesting article http://networkadminkb.com/KB/a467/how-to-fix-windows-2008-r2-system-state-backup-fails.aspx which refers to the same issue for OS virtualized using VMWare ESX.

As the backup was failing during enumerating files i followed http://blogs.technet.com/b/askcore/archive/2010/06/18/reasons-why-the-error-enumeration-of-the-files-failed-may-occur-during-system-state-backup.aspx Reasons why the error Enumeration of the files failed may occur during System State backup.

Checking all Image Paths for correct value is pain in itself, and its further complicated by multiple valid syntaxes. Thanks to Tom Acker  for proving a nice and easy way to find invalid image paths with GetInvalidImagePath script.

Running this script revealed multiple image paths with space which needed to be enclosed in quotes and few more keys with incorrectly added forward slash “/” in image paths.

Once Image Paths were cleaned, system state worked like a charm 🙂

With valid backup available, now these servers are good the receive long awaited patches missed for previous and current cycle.

Case of Missing RAM from SQL server and AWE

Recently I came across any issue with Windows 2008 R2 server with high memory utilization. This server was hosting a custom monitoring tool and it was not servicing runtime reporting request to do performance hit.

The Server had 8 CPU cores and 12 GB ram. CPU utilization was in check however RAM utilization was above 95% consistently. Server owner informed that they are usually forced to reboot the box to get memory utilization under control and generally after 2-3 days uptime memory spikes again to 95-100% and never goes down. First look at  the task manager revealed that memory utilization above 95% however, total memory consumed by processes under process tab was approx 1.5 GB which is less than 13%.

So we started with case of a missing ram. Looking at installed products list, it mentioned SQL 2008 R2 SP2. Off course the usual suspect was SQL. Databases largely  follow Linux memory policy, “Free memory is wasted memory”. Like any other database product SQL has the tendency to occupy free memory as required.

For performance issues there are multiple tools available out there but for Advanced memory utilization analysis RAMMAP from Sysinternals is best choice. RAMMAP revealed almost 10 GB was occupied by AWE????

Address Windowing Extensions or AWE is windows memory management functions which is used to allow more than 3GB memory to standard 32 bit application. Using AWE for SQL was great option on 32 bit OS with high amount of RAM. But we are running windows x64, where AWE should not have been used.

SQL memory utilization settings showed a different picture all together,

image

Although SQL is set to default value of allow maximum available memory to be utilized “Use AWE to allocate memory” was unchecked. SQL was still our primary suspect and to isolate SQL we took the downtime for application and stopped SQL service. Indeed SQL was the culprit, as immediately AWE utilization was clear and total memory utilized on server was less than 20%. Remember the saying, Things are not always what they look like!

image

We changed the maximum server memory settings for SQL to be 8 GB and started up the services. This time SQL had its max 8 GB and our monitoring APP had sufficient breathing space for all the data collection and reporting 🙂

However, I was wondering on why did i never saw this issue on my test servers? It turns out that AWE cannot be used by any account. Its control by GPO setting “Lock pages in Memory” option.

Capture

If you configure a user account to run SQL services, by default no user has rights for Lock Pages in memory settings and SQL wont be able to use AWE settings. In our case SQL service was running under Local System account which by default has the rights for using AWE.