SQLServerWiki

“The Only Thing That Is Constant Is Change”

Cluster Failover RCA Can be interesting at times!

Posted by database-wiki on July 9, 2012

Actual time of failure from the system log saved as text.

Information           1/12/2012 3:40:51 PM       Service Control Manager     7036       None      The WinHTTP Web Proxy Auto-Discovery Service service entered the stopped state.

Information           1/12/2012 3:24:20 PM       Service Control Manager     7036       None      The WinHTTP Web Proxy Auto-Discovery Service service entered the running state.

Information           1/12/2012 3:23:23 PM       Service Control Manager     7036       None      The SQL Server Agent (TEST_SQL2005) service entered the running state.

Came online:

Information           1/12/2012 3:23:17 PM       Service Control Manager     7036       None      The SQL Server (TEST_SQL2005) service entered the running state.

Information           1/12/2012 3:09:58 PM       Service Control Manager     7036       None      The WMI Performance Adapter service entered the stopped state.

Information           1/12/2012 3:06:53 PM       Service Control Manager     7036&nbsnbsp;     None      The WMI Performance Adapter service entered the running state.

Information           1/12/2012 3:06:32 PM       Service Control Manager     7036       None      The SQL Server (TEST_SQL2005) service entered the stopped state.

Error        1/12/2012 3:06:29 PM       Service Control Manager     7034       None      The SQL Server Agent (TEST_SQL2005) service terminated unexpectedly.  It has done this 4 time(s).

Went offline:

Error        1/12/2012 3:05:53 PM       Microsoft-Windows-FailoverClustering            1069       Resource Control Manager Cluster resource ‘SQL Server (TEST_SQL2005)’ in clustered service or application ‘TEST-SQL-01’ failed.

Error        1/12/2012 1:52:25 PM       TermDD  56           None      The Terminal Server security layer detected an error in the protocol stream and has disconnected the client.

Cluster log:

00000aac.000037b4::2012/01/12-13:41:58.791 INFO  [NM] Received request from client address fe80::f1d9:57a6:9ec0:9d21.

00000aac.000014e0::2012/01/12-15:00:18.150 INFO  [NM] Received request from client address 11.300.0.13.

00000f4c.0000277c::2012/01/12-15:05:51.898 ERR   [RES] SQL Server <SQL Server (TEST_SQL2005)>: [sqsrvres] CheckQueryProcessorAlive: sqlexecdirect failed

=>  CheckQueryProcessorAlive – This is the function that checks to see if the SQL Server responds to simple queries. The query we do is “select @@servername”

00000f4c.0000277c::2012/01/12-15:05:51.938 ERR   [RES] SQL Server <SQL Server (TEST_SQL2005)>: [sqsrvres] printODBCError: sqlstate = HYT00; native error = 0; message = [Microsoft][SQL Server Native Client 10.0]Query time expired

00000f4c.0000277c::2012/01/12-15:05:51.947 ERR   [RES] SQL Server <SQL Server (TEST_SQL2005)>: [sqsrvres] OnlineThread: QP is not online.

00000f4c.0000277c::2012/01/12-15:05:51.957 ERR   [RES] SQL Server <SQL Server (TEST_SQL2005)>: [sqsrvres] CheckQueryProcessorAlive: sqlexecdirect failed

00000f4c.0000277c::2012/01/12-15:05:51.963 ERR   [RES] SQL Server <SQL Server (TEST_SQL2005)>: [sqsrvres] printODBCError: sqlstate = 08S01; native error = 0; message = [Microsoft][SQL Server Native Client 10.0]The connection is no longer usable because the server failed to respond to a command cancellation for a previously executed statement in a timely manner. Possible causes include application deadlocks or the server being overloaded. Open a new connection and re-try the operation.

00000f4c.0000277c::2012/01/12-15:05:51.963 ERR   [RES] SQL Server <SQL Server (TEST_SQL2005)>: [sqsrvres] CheckQueryProcessorAlive: sqlexecdirect failed

00000f4c.0000277c::2012/01/12-15:05:51.963 ERR   [RES] SQL Server <SQL Server (TEST_SQL2005)>: [sqsrvres] printODBCError: sqlstate = 08S01; native error = 0; message = [Microsoft][SQL Server Native Client 10.0]The connection is no longer usable because the server failed to respond to a command cancellation for a previously executed statement in a timely manner. Possible causes include application deadlocks or the server being overloaded. Open a new connection and re-try the operation.

00000f4c.0000277c::2012/01/12-15:05:51.964 ERR   [RES] SQL Server <SQL Server (TEST_SQL2005)>: [sqsrvres] CheckQueryProcessorAlive: sqlexecdirect failed

00000f4c.0000277c::2012/01/12-15:05:51.964 ERR   [RES] SQL Server <SQL Server (TEST_SQL2005)>: [sqsrvres] printODBCError: sqlstate = 08S01; native error = 0; message = [Microsoft][SQL Server Native Client 10.0]The connection is no longer usable because the server failed to respond to a command cancellation for a previously executed statement in a timely manner. Possible causes include application deadlocks or the server being overloaded. Open a new connection and re-try the operation.

00000f4c.0000277c::2012/01/12-15:05:51.964 ERR   [RES] SQL Server <SQL Server (TEST_SQL2005)>: [sqsrvres] CheckQueryProcessorAlive: sqlexecdirect failed

00000f4c.0000277c::2012/01/12-15:05:51.964 ERR   [RES] SQL Server <SQL Server (TEST_SQL2005)>: [sqsrvres] printODBCError: sqlstate = 08S01; native error = 0; message = [Microsoft][SQL Server Native Client 10.0]The connection is no longer usable because the server failed to respond to a command cancellation for a previously executed statement in a timely manner. Possible causes include application deadlocks or the server being overloaded. Open a new connection and re-try the operation.

00000f4c.0000277c::2012/01/12-15:05:51.964 ERR   [RES] SQL Server <SQL Server (TEST_SQL2005)>: [sqsrvres] CheckQueryProcessorAlive: sqlexecdirect failed

00000f4c.0000277c::2012/01/12-15:05:51.964 ERR   [RES] SQL Server <SQL Server (TEST_SQL2005)>: [sqsrvres] printODBCError: sqlstate = 08S01; native error = 0; message = [Microsoft][SQL Server Native Client 10.0]The connection is no longer usable because the server failed to respond to a command cancellation for a previously executed statement in a timely manner. Possible causes include application deadlocks or the server being overloaded. Open a new connection and re-try the operation.

IsAlive check failed. ( SQL server was hung or was in a state where no more connection can be facilitated by SQL Server.)

00000f4c.000010bc::2012/01/12-15:05:52.713 WARN  [RHS] Resource SQL Server (TEST_SQL2005) IsAlive has indicated failure.

00000aac.00003450::2012/01/12-15:05:52.726 INFO  [RCM] TransitionToState(SQL Server (TEST_SQL2005)) Online–>ProcessingFailure.

00000aac.00003450::2012/01/12-15:05:52.733 ERR   [RCM] rcm::RcmResource::HandleFailure: (SQL Server (TEST_SQL2005))

00000aac.00003450::2012/01/12-15:05:53.287 INFO  [RCM] resource SQL Server (TEST_SQL2005): failure count: 6, restartAction: 0.

00000aac.00003450::2012/01/12-15:05:53.287 INFO  [RCM] TransitionToState(SQL Server (TEST_SQL2005)) ProcessingFailure–>[Terminating to Failed].

00000aac.00003450::2012/01/12-15:05:53.289 INFO  [RCM] rcm::RcmGroup::ProcessStateChange: (TEST-SQL-01, Online –> Pending)

00000aac.00003450::2012/01/12-15:05:53.289 INFO  [RCM] TransitionToState(SQL Server Agent (TEST_SQL2005)) Online–>[Terminating to OfflineDueToProvider].

00000aac.00003450::2012/01/12-15:05:53.300 INFO  [RCM] Will retry online of SQL Server (TEST_SQL2005) in 3600000 milliseconds.

00000f4c.0000277c::2012/01/12-15:06:07.264 ERR   [RES] SQL Server <SQL Server (TEST_SQL2005)>: [sqsrvres] ODBC sqldriverconnect failed

00000f4c.0000277c::2012/01/12-15:06:07.264 ERR   [RES] SQL Server <SQL Server (TEST_SQL2005)>: [sqsrvres] checkODBCConnectError: sqlstate = 08001; native error = 102; message = [Microsoft][SQL Server Native Client 10.0]TCP Provider: Timeout error [258].

00000f4c.0000277c::2012/01/12-15:06:07.264 ERR   [RES] SQL Server <SQL Server (TEST_SQL2005)>: [sqsrvres] ODBC sqldriverconnect failed

00000f4c.0000277c::2012/01/12-15:06:07.264 ERR   [RES] SQL Server <SQL Server (TEST_SQL2005)>: [sqsrvres] checkODBCConnectError: sqlstate = HYT00; native error = 0; message = [Microsoft][SQL Server Native Client 10.0]Login timeout expired

00000f4c.0000277c::2012/01/12-15:06:07.264 ERR   [RES] SQL Server <SQL Server (TEST_SQL2005)>: [sqsrvres] ODBC sqldriverconnect failed

00000f4c.0000277c::2012/01/12-15:06:07.265 ERR   [RES] SQL Server <SQL Server (TEST_SQL2005)>: [sqsrvres] checkODBCConnectError: sqlstate = 08001; native error = 102; message = [Microsoft][SQL Server Native Client 10.0]Unable to complete login process due to delay in prelogin response

00000f4c.0000277c::2012/01/12-15:06:07.265 INFO  [RES] SQL Server <SQL Server (TEST_SQL2005)>: [sqsrvres] ConnectToSQL: asked to terminate while trying to connect to server.

=>  connectToSQL – function which connects to the SQL Server resource being managed. We use ODBC for this connection.

=>  sqsrvresCheckServiceAlive – function which checks to see if the SQL Server service is alive.

00000f4c.0000277c::2012/01/12-15:06:07.269 ERR   [RES] SQL Server <SQL Server (TEST_SQL2005)>: [sqsrvres] OnlineThread: Error connecting to SQL Server.

00000f4c.0000277c::2012/01/12-15:06:07.269 INFO  [RES] SQL Server <SQL Server (TEST_SQL2005)>: [sqsrvres] OnlineThread: asked to terminate while waiting for QP.

 =>  sqsrvresOnlineThread – function which brings a resource from the resource table online.

The above function check if the SQL Server is online. If no, it will try to terminate the SQL Server service along with its dependencies and will do a failover or restart depending on the failover settings.

00000aac.000026bc::2012/01/12-15:06:29.651 INFO  [RCM] TransitionToState(SQL Server Agent (TEST_SQL2005)) [Terminating to OfflineDueToProvider]–>OfflineDueToProvider.

00000aac.00003450::2012/01/12-15:06:32.786 INFO  [RCM] TransitionToState(SQL Server (TEST_SQL2005)) [Terminating to Failed]–>Failed.

00000aac.00003450::2012/01/12-15:06:32.786 INFO  [RCM] rcm::RcmGroup::ProcessStateChange: (TEST-SQL-01, Pending –> Failed)

=> These are the errors we see in the SQL Server ERRORLOG when SQL Server went down.

2012-01-04 12:55:23.75 spid1s      A significant part of sql server process memory has been paged out. This may result in a performance degradation. Duration: 0 seconds. Working set (KB): 38836, committed (KB): 89208, memory utilization: 43%.

….

….

2012-01-04 12:55:34.65 spid1s      A significant part of sql server process memory has been paged out. This may result in a performance degradation. Duration: 0 seconds. Working set (KB): 133964, committed (KB): 269544, memory utilization: 49%.

….

….

2012-01-12 15:02:18.69 Server      All schedulers on Node 0 appear deadlocked due to a large number of worker threads waiting on CXPACKET. Process Utilization 1%.

2012-01-12 15:03:18.70 Server      All schedulers on Node 0 appear deadlocked due to a large number of worker threads waiting on CXPACKET. Process Utilization 1%.

2012-01-12 15:04:43.71 Server      All schedulers on Node 0 appear deadlocked due to a large number of worker threads waiting on CXPACKET. Process Utilization 2%.

2012-01-12 15:05:43.72 Server      All schedulers on Node 0 appear deadlocked due to a large number of worker threads waiting on CXPACKET. Process Utilization 1%.

2012-01-12 15:06:32.14 spid2s      Error: 9001, Severity: 21, State: 5.

2012-01-12 15:06:32.14 spid2s      The log for database ‘archs’ is not available. Check the event log for related error messages. Resolve any errors and restart the database.

2012-01-12 15:06:32.16 spid2s      SQL Trace was stopped due to server shutdown. Trace ID = ‘1’. This is an informational message only; no user action is required.

2012-01-12 15:06:32.23 Logon       Error: 17188, Severity: 16, State: 1.

2012-01-12 15:06:32.23 Logon       SQL Server cannot accept new connections, because it is shutting down. The connection has been closed. [CLIENT: 10.200.14.11]

2012-01-12 15:06:32.23 Logon       Error: 17188, Severity: 16, State: 1.

2012-01-12 15:06:32.23 Logon       SQL Server cannot accept new connections, because it is shutting down. The connection has been closed. [CLIENT: 10.200.14.11]

The last message before terminating was pointing towards CXPACKET holding lots of threads on the scheduler and causing a server wide hung.

Solving the puzzle!!!

Inside SQL Server instance, Lots of threads were holding on the scheduler due to CXPACKET wait (Occurs when trying to synchronize the query processor exchange iterator.

You may consider lowering the degree of parallelism if contention on this wait type becomes a problem.) that causes SQL Server wide hung or freeze. Once SQL Server

freezes its cannot facilitate any new connect and ISALIVE check from the windows cluster need to connect to SQL Server to check its up and running. If the ISALIVE fails

then the cluster service will terminate SQL Server services also its dependents like SQL Agent before bring it online once again or failover to other node. ( depends up on the failover settings. )

=>  We did the following simple changes and now the instance is doing just fine.

#############################################################################

****************** SERVER SETTINGS ********************

#############################################################################

1. The box either has multi-core processors or hyperthreading is enabled or both

Number of physical cores: 2

Number of logical processors: 8

MAXDOP might need to be changed to number of physical processors as per KB 322385 if we see lot of wait-resources on CXPACKET

We suggested to set the MAXDOP setting to 4.

Command:

sp_configure ‘show advanced options’, 1;

GO

RECONFIGURE;

GO

sp_configure ‘max degree of parallelism’,4

GO

RECONFIGURE;

GO

2. The total physical memory is 18 gigs.

Physical memory on the box: 18429 MB

Lot of signs of internal memory pressure on the server. we suggest you to leaving at least 4Gb for OS and other processes and provide 14GB to SQL Server.

Command:

sp_configure ‘max server memory (MB)’,14336

GO

RECONFIGURE;

GO

3. We are still in SP2 for SQL Server 2005, Microsoft SQL Server 2005 – 9.00.3042.00 (X64)

=> we need to upgrade the instance to at least SP3 to SP4 with cumulative updates.

General recommendation:

4. Look at your NIC configuration on the server and make sure there are no issues (physical card issue, autodetect not picking the fastest speed)

5. Validate the network components between the application/clients and the SQL Server instance (router, for example). This should be done as per

Your application design. Client -> webserver ( if applicable ) -> SQL Server box

– Update the network related drivers on Client , webserver ( if applicable ) , SQL Server box

6. Make sure your SAN Disk Drivers are upto date as well as windows level patches.

Information overload:

Cluster internals:

Each Cluster resource exists in one of five states: offline, offline pending, online, online pending, and failed. When a resource is online, it is active and available for use. Offline pending and online pending are transitional, internal states that reflect a change from online to offline and vice versa. When MSCS notes a resource’s state as failed, a failover ordinarily occurs; however, you can effect a manual failover from Cluster Administrator. One type of failure detection that is resource-independent occurs when a node or the communications links between nodes fails. Cluster Network Driver detects these problems and notifies Failover Manager and Membership Manager. The affected nodes determine, with the aid of Membership Manager and Resource Manager, whether they should attempt the failover of their resources.

Individual resources can fail, and the detection of individual resource failure is the domain of each resource’s Resource DLL and the Resource Monitor. Resource Monitor pings each resource’s Resource DLL at specified intervals during the configuration of the resource to determine if the resource is functional. Resource DLLs implement two entry points for failure detection: LooksAlive and IsAlive. At a configured polling interval, Resource Monitor sends the Resource DLL a LooksAlive request. Upon this request, the Resource DLL performs a cursory check to see if the resource is OK, and the Resource DLL must respond to Resource Monitor within 150 milliseconds (ms). If the LooksAlive request fails, Resource Monitor will send an IsAlive request to the Resource DLL. To respond to the IsAlive request, the Resource DLL has up to 300ms to determine if the resource has failed. If the Resource DLL answers no to Resource Monitor in response to the IsAlive request, Resource Monitor considers the resource to be offline and signals to Cluster Service to fail over the group that contains the failed resource.

Failure Detection

Failure detection and prevention are key benefits provided by server clusters. When a node or application in a cluster fails, server clusters can respond by restarting the failed application or dispersing the work from the failed system to surviving nodes in the cluster. Server cluster failure detection and prevention includes bi-directional failover, application failover, parallel recovery, and automatic failback.

The cluster service detects failures of individual resources or an entire node and dynamically moves and restarts application, data, and file resources on an available, healthy server in the cluster. This allows resources such as database, file shares, and applications to remain highly available to users and client applications.

Server clusters are designed with two different failure detection mechanisms:

•          Heartbeat for detecting node failures.

•          Resource Monitor and resource DLLs for detecting resource failures.

Detecting Node Failures

Periodically, each node exchanges datagram messages with other nodes in the cluster using the private cluster network. These messages are referred to as the heartbeat. The heartbeat exchange enables each node to check the availability of other nodes and their applications. If a server fails to respond to a heartbeat exchange, the surviving servers initiate failover processes including ownership arbitration for resources and applications owned by the failed server. Arbitration is performed using a challenge and defense protocol.  In other words, the node that appears to have failed is given a time window where it is expected to demonstrate in any of several ways that it is still up and running properly, and can communicate with the surviving nodes.  If it is unable to do that, only then is it removed from the cluster.

Failure to respond to a heartbeat message can be caused by several events, such as computer failure, network interface failure, network failure, or even periods of unusually high activity. Normally, when all nodes are communicating, Configuration Database Manager sends global configuration database updates to each node. However, when a failure on a heartbeat exchange occurs, the Log Manager also saves configuration database changes to the quorum resource. This ensures that surviving nodes can access the most recent cluster configuration and local node registry key data during recovery processes.

Note that the failure detection algorithm is very conservative: in other words, it gives the apparently failed node the benefit of the doubt as much as possible before proceeding with the failover process.  Clearly, if the cause of the heartbeat response failure is temporary, it is best to avoid the potential disruption a failover might cause.  However, there is no way to know whether the node is going to be quiet for another millisecond, another second, or, on the other end of the scale, it has suffered a catastrophic failure.  As a result, after a “reasonable” period of time, a failover is initiated.

Detecting Resource Failures

The Failover Manager and Resource Monitors work together to detect and recover from resource failures. Resource Monitors keep track of resource status by periodically polling resources using the resource DLLs. Polling involves two steps, a cursory LooksAlive query and longer, more definitive, IsAlive query. When the Resource Monitor detects a resource failure, it notifies the Failover Manager and continues to monitor the resource.

The Failover Manager maintains resources and resource group status. It is also responsible for performing recovery when a resource fails and will invoke Resource Monitors in response to user actions or failures.

After a resource failure is detected the Failover Manager can perform recovery actions that include either restarting a resource and its dependent resources, or moving the entire resource group to another node. Which recovery action is performed is determined by resource and resource group properties and node availability.

During failover, the resource group is treated as the failover unit, to ensure that resource dependencies are correctly recovered. Once a resource recovers from a failure, the Resource Monitor notifies the Failover Manager, which can then perform automatic failback of the resource group, based on the configuration of the resource group failback properties.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: