A while ago the cluster service began to restart randomly on the HyperV hosts and the only events logged were :
An internal Cluster service operation exceeded the defined threshold of ‘110’ seconds. The Cluster service has been terminated to recover. Service Control Manager will restart the Cluster service and the node will rejoin the cluster.
The cluster Resource Hosting Subsystem (RHS) process was terminated and will be restarted. This is typically associated with cluster health detection and recovery of a resource. Refer to the System event log to determine which resource and resource DLL is causing the issue.
None of them relevant in my case at that moment, RHS is something that I’ve talked in previous posts and the other event it’s related to the restart, so something else was determining the Cluster Service to stop randomly.
There was an event logged regarding Global Update Manager and after reading about it, I also checked with Microsoft Premier Engineer about it. They confirmed that the cluster services could stop because all of the requests for information that were coming from VMM and SCOM agents.
When a state change occurs such as a cluster resource is taken offline, the nodes in a failover cluster must be notified of the change and acknowledge it before the cluster commits the change to the database. The Global Update Manager is responsible for managing these cluster database updates.
What we had to do was to change the DatabaseReadWriteMode from 1 (default) to 0 by using Powershell: (Get-Cluster).DatabaseReadWriteMode= 0
DatabaseReadWriteMode = 0 means that the cluster is operating in synchronous mode and a Write to the cluster database needs to be committed to all nodes before any can move forward, and because the cluster is operating in a synchronous mode each node is guaranteed to have the most up to date view of the cluster database. So when a read occurs, it can be returned from that node only.
DatabaseReadWriteMode = 1 (default mode for HyperV Clusters) means that the cluster is operating in an asynchronous mode. A Write only needs to be committed to a majority of nodes and then the cluster moves forward. When there is a Read request, because a node is not guaranteed that it has the most current view, it needs to confirm with other nodes that the data is accurate. So for a Read, it needs to confirm with a majority of nodes before satisfying the request.
You can read more about the Global Update Manager here