These days I was troubleshooting a Hyper-V cluster and came across a good article about how failover clustering recovers from unresponsive resource and I wanted to explain a little bit how the failover clustering communicates with cluster resources and how it detects and recovers when something goes wrong.
For each clustered Virtual Machine , a cluster “Virtual Machine” resource is created that controls the VM. The cluster resource that is created and its resource DLL communicates with the VMMS (Virtual Machine Management Service) service which tells the VM when to start, stop and it’s also checking the virtual machine state. All these resources run in a failover cluster component which is called Resource Hosting Subsystem(RHS).
RHS is responsible for making sure that cluster resources are working properly by making constantly health checks. There are two types of health checks that RHS is doing: LooksAlive which happens every 5 seconds and it’s a quick light check, and if this one fails there is the second check IsAlive that happens every 60 seconds. If the IsAlive check fails, then the resource is considered failed.
RHS waits 5 minutes for a resource to answer, value that is configured with DeadlockTimeOut property, and if the resource is not answering RHS needs to take a recovery action to get it back up and running. The resource will be up and running again after the RHS process is terminated, and restarted which will determin the resource to restart also. The Event IDs related to these actions are 1230 “Cluster resource ‘Resource Name’ (resource type ‘Resource Type Name’, DLL ‘DLL Name’) did not respond to a request in a timely fashion. Cluster health detection will attempt to automatically recover by terminating the Resource Hosting Subsystem (RHS) process running this resource.” and 1146 “The cluster Resource Hosting Subsystem (RHS) stopped unexpectedly. An attempt will be made to restart it. This is usually associated with recovery of a crashed or deadlocked resource. “.
Elden Christensen has a great and more detailed blog post on how failover clustering recovers from unresponsive resource.