Sunday, July 12, 2009

Windows 2003 Cluster: SQL Node Not able to Join Cluster after Changing MaxTokenSize

As per Cannot Generate SSPI Context article, Link, I had changed the MaxTokenSize to fix the issue. I had read various articles before jumping into this change as being the only proper solution to the problem.

However after changing the setting and rebooting the passive node the node refused to join the cluster; looking at the error logs I have messages like …

  • Unable to get join version data from sponsor xxx.xxx.xxx.xxx using NTLM package, status 5.
  • Unable to connect to any sponsor node.
  • Failed to join cluster, status 53.
  • Physical Disk : [DiskArb] Failed to read (sector 12), error 170.
Researching on net I could not figure out results; first article I ran into was KB886717, [1]. It suggested the issue might be because the C:\Windows\Cluster folder was over size of 10MB; which in this case it was. So I removed the log file and tried to restart services with no luck.

I started reading through the log file and started to pick through error messages and looked and articles [2], [3], [5], [6], and [7]. I verified all the settings each article suggested, Group Policies, Security, firewall, etc. None of it seems to help in making sure the passive node would join the cluster successfully.

I started trouble ticket with Microsoft; they found few settings, on of being NTML Compatibility Level that needed changing as part of their troubleshooting but even after changing these settings we were still getting NTLM, state 5 error messages in Cluster.log file (State 5 means permissions denied). Talking to Microsoft they referenced few more KB articles in addition to what I had found already like [8], [9], [10], & [11] that indicated what “might” be the issue; but none of them seem to help resolve our issue.

This whole time we had not rebooted the active node as it was working successfully; but since we were hitting stone-wall every turn we decided to further troubleshoot the issue node 1 (active node) must be restarted because the errors that were being generated on Quorum disk. After rebooting Active Node, the Passive Node came active and Clustering was working successfully.

I had not read any KB article indicating the issue with MaxTokenSize and Windows Clustering, and neither had the Microsoft guys. So talking to the Kerberos experts we figured issue was similar to [7], in which if you change password or the password length is less then 15 characters of the Cluster Services account permissions or security settings are not properly hashed and generates errors when authenticating the new node to the cluster.

So if you are changing the MaxTokenSize setting on SQL Server and it is a cluster please make sure you change it on EVERY NODE; or you will have lots of strange issues that probably shouldn’t exist.

Reference Links:
[1] KB886717 Issue with Cluster Log file, Link.
[2] Problems with Microsoft Clusters, Link.
[3] How to manually re-create the Cluster service account, Link.
[4] A Windows Server 2003 based-computer that is running the Cluster service may be unable to join a cluster after the computer is first restarted, Link.
[5] Cluster Service May Not Start After You Restrict Available IP Ports for Remote Procedure Call, Link.
[6] Ask Core!, Troubleshooting Cluster Logs 101 Why did the resource failover to other node?, Link.
[7] Cluster service account password must be set to 15 or more characters if the NoLMHash policy is enabled, Link.
[8] You cannot add an additional node to a Windows Server 2003-based server cluster, and error code "0x8007042b" is logged in the ClCfgSrv.log file, Link.
[9] You receive an "Error 0x8007042b" error message when you add or join a node to a cluster if you use NTLM version 2 in Windows Server 2003, Link.
[10] How to enable NTLM 2 authentication, Link.
[11] Cluster service does not start on joining node in Windows 2000 Cluster, Link.

SSPI Context Cannot be generated

I am sure everyone has seen this error there are many articles on MSDN, newsgroups, etc.; but I ran into very interesting problem with these that none of the normal articles helped. Articles like [1], [2], [3], and [4] talk about various settings that can generated that error, from being SPN (Service Principal Name), to SQL Protocol Issues, to TDS packet issue, but none of these helped me resolve the issue. As I was getting this issue with one user only; unable to solve it I kept looking around and ran across article [5]; which indicated this is an issue with Kerberos authentication. Which was indicated by previous articles also, but what was interesting it stated that the Token generated by the user account in question was larger then the default size of 12,000 bytes thus the issue.

I had no way to check how many groups the user was in, so working with Active Directory guy we created copy of account and removed one group at a time until the user was able to login successfully. So in a sense we verified that number of groups the user belonged to was causing us the issue. But I did no have a work around for this, as article [5] suggested I was thinking of changing MaxTokenSize to FFFF (65535). I needed to verify that was the issue for sure; as it was only one user who was reported the issue. I had access to far more groups and resources but did not have any issues. After researching more I found Microsoft Utility, TokenSz [6] & [7] that lets you compute the token size generated for Kerberos authentication.

Running TokenSz utility on users account and my account we got interesting information …

Users account
Name: Kerberos Comment: Microsoft Kerberos V1.0
Current PackageInfo->MaxToken: 65535

Using user to user
QueryKeyInfo:
Signature algorithm =
Encrypt algorithm = RSADSI RC4-HMAC
KeySize = 128
Flags = 2083e
Signature Algorithm = -138
Encrypt Algorithm = 23
Start:7/12/2009 7:47:09
Expiry:7/12/2009 17:47:09
Current Time: 7/12/2009 7:47:09
MaxToken (complete context) 13383

If you notice users max token was above the default allowed of 12,000; so SQL Server failed in the authentication. I checked my token size it was only 7479. Looking at how the token is generated in [5]; it includes information from nested groups. So this had major impact on the number of groups she was really in.

Solution? We decided to change the MaxTokenSize on our SQL Server.

This caused another interesting (another word for annoying :S) issue, Link. But after all issues were resolved user was able to log on successfully.

Reference Links:
[1] How to troubleshoot the "Cannot generate SSPI context" error message, Link.
[2] PRB: Error Message: Cannot Generate SSPI Context, Link.
[3] SQL Protocols: "Cannot generate SSPI Context" Error message, when connecting to local SQL Server, Link & Link.
[4] How to configure an SPN for SQL Server Databases Servers, Link.
[5] New Resolution for Problems with Kerberos Authentication when users belong to many groups, Link.
[6] How do Token Size Affect SQL Server, Link.
[7] TokenSz Utility Download, Link.

Friday, June 12, 2009

Default Backup Path Error

I had blogged about how to set default backups for SQL Server 2005 (Link) and SQL Server 2008 (Link); but I recently ran into an error.

Those registry settings were very helpful for me because I do most backups or restores for databases using T-SQL; but this one time I got lazy and decide to use the GUI interface and got following error when trying to enter the file name.

TITLE: Locate Backup File - ServerName
------------------------------

\\UNCPathServerName\HiddenShare$\ServerName
Cannot access the specified path or file on the server. Verify that you have the necessary security privileges and that the path or file exists.

If you know that the service account can access a specific file, type in the full path for the file in the File Name control in the Locate dialog box.

------------------------------
BUTTONS:
OK
------------------------------


So please be cautious when changing the registry settings; the GUI interface cannot work with UNC.

Friday, May 29, 2009

Bench Mark Logging

Bench marking also system administrators to easily see what changed between the time system went live and the current performance issue being experienced. So I was setting this up on the new SQL 2005 Cluster running on Windows 2008; I have setup following counters. Everyone's needs will be different; if there are counters anyone think I should consider adding please email me at gupta.mohit.k@gmail.com or leave me comment here :) Thanks!

I am tracking following counters and my reasoning for it ...

System/OS Level:

  • Logical Disk, I know many people recommend physical disk just in case you are running multiple partitions on one disk. In my case each logical disk does not get shared with anything else so I am only tracking Logical performace.
  • Memory, I am running Active-Active configuration so I need to know how well that will work with memory (I got Min/Max, Lock pages in memory all configured).
  • Page file, Again testing memory pressure.
  • Network, Network I/O
  • Processor, Processor % Time, Queue Length etc.
SQL Level:
  • Buffer Manager, Memory pressure.
  • Database Statistics, Size, Log usage, # of transactions.
  • General Server Statistics, Application in question uses a lot of temp tables so I need to know when the usage of temp tables increases in case I have to adjust tempdb.
  • Latch & Lock Information, Blocking information, if any.
  • Transaction Information, Space in tempdb, batch requests, recomplies, plan cache, etc.
  • Wait Statistics, any waits taking place? If so what, network, disk, memory, or cpu?
Counter List:
\LogicalDisk(*)\% Disk Read Time
\LogicalDisk(*)\% Disk Time
\LogicalDisk(*)\% Disk Write Time
\LogicalDisk(*)\% Free Space
\LogicalDisk(*)\% Idle Time
\LogicalDisk(*)\Avg. Disk Bytes/Read
\LogicalDisk(*)\Avg. Disk Bytes/Write
\LogicalDisk(*)\Avg. Disk Read Queue Length
\LogicalDisk(*)\Avg. Disk sec/Read
\LogicalDisk(*)\Avg. Disk sec/Write
\LogicalDisk(*)\Avg. Disk Write Queue Length
\Memory\Available MBytes
\Memory\Page Faults/sec
\Memory\Pages/sec
\MSSQL$SQLInstanceName:Buffer Manager\Buffer cache hit ratio
\MSSQL$SQLInstanceName:Buffer Manager\Checkpoint pages/sec
\MSSQL$SQLInstanceName:Buffer Manager\Page life expectancy
\MSSQL$SQLInstanceName:Databases(*)\Active Transactions
\MSSQL$SQLInstanceName:Databases(*)\Backup/Restore Throughput/sec
\MSSQL$SQLInstanceName:Databases(*)\Data File(s) Size (KB)
\MSSQL$SQLInstanceName:Databases(*)\Log File(s) Size (KB)
\MSSQL$SQLInstanceName:Databases(*)\Log Flushes/sec
\MSSQL$SQLInstanceName:Databases(*)\Percent Log Used
\MSSQL$SQLInstanceName:Databases(*)\Transactions/sec
\MSSQL$SQLInstanceName:General Statistics\Active Temp Tables
\MSSQL$SQLInstanceName:General Statistics\Logins/sec
\MSSQL$SQLInstanceName:General Statistics\Transactions
\MSSQL$SQLInstanceName:Latches\Average Latch Wait Time (ms)
\MSSQL$SQLInstanceName:Latches\Latch Waits/sec
\MSSQL$SQLInstanceName:Latches\Total Latch Wait Time (ms)
\MSSQL$SQLInstanceName:Locks(*)\Average Wait Time (ms)
\MSSQL$SQLInstanceName:Locks(*)\Lock Requests/sec
\MSSQL$SQLInstanceName:Locks(*)\Lock Timeouts (timeout > 0)/sec
\MSSQL$SQLInstanceName:Locks(*)\Lock Waits/sec
\MSSQL$SQLInstanceName:Locks(*)\Number of Deadlocks/sec
\MSSQL$SQLInstanceName:Memory Manager\Connection Memory (KB)
\MSSQL$SQLInstanceName:Memory Manager\Target Server Memory (KB)
\MSSQL$SQLInstanceName:Memory Manager\Total Server Memory (KB)
\MSSQL$SQLInstanceName:Plan Cache(SQL Plans)\*
\MSSQL$SQLInstanceName:Plan Cache(Temporary Tables & Table Variables)\*
\MSSQL$SQLInstanceName:SQL Errors(*)\*
\MSSQL$SQLInstanceName:SQL Statistics\Batch Requests/sec
\MSSQL$SQLInstanceName:SQL Statistics\SQL Compilations/sec
\MSSQL$SQLInstanceName:SQL Statistics\SQL Re-Compilations/sec
\MSSQL$SQLInstanceName:Transactions\Free Space in tempdb (KB)
\MSSQL$SQLInstanceName:Transactions\Longest Transaction Running Time
\MSSQL$SQLInstanceName:Transactions\Transactions
\MSSQL$SQLInstanceName:Wait Statistics(*)\Lock waits
\MSSQL$SQLInstanceName:Wait Statistics(*)\Memory grant queue waits
\MSSQL$SQLInstanceName:Wait Statistics(*)\Network IO waits
\MSSQL$SQLInstanceName:Wait Statistics(*)\Page IO latch waits
\MSSQL$SQLInstanceName:Wait Statistics(*)\Page latch waits
\MSSQL$SQLInstanceName:Wait Statistics(*)\Wait for the worker
\Network Interface(*)\Bytes Received/sec
\Network Interface(*)\Bytes Sent/sec
\Network Interface(*)\Bytes Total/sec
\Network Interface(*)\Packets/sec
\Paging File(*)\*
\Processor(*)\% Privileged Time
\Processor(*)\% Processor Time
\Processor(*)\Interrupts/sec
\System\Context Switches/sec
\System\Processes
\System\Processor Queue Length
\System\Threads
\Terminal Services\*

SQL Server 2005 Full Text Search services failed to Start Up

Right after setting up SQL Server 2005 cluster on Windows 2008; the FTE Services did not start up automatically. Following error gets recorded in the event log/cluster events:

Cluster resource 'SQL Server Fulltext (InstanceName)' in clustered service or application 'VirtualClusterName' failed.

Generic application 'SQL Server Fulltext (InstanceName)' could not be brought online (with error '1075') during an attempt to start the service. Possible cause: the specified service parameters might be invalid.


Microsoft Engineer pointed me to KB936302, Problem #3; however said solution of install SP2 was not enough. The services still refused to come online, to make it so services started successfully we edited following registry entry:

Key: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\MSFESQL$InstanceName\
Multi-String Value: DependsOnService
Value Changed from ...

NTLMSSP
RPCSS

to

RPCSS

Rebooted each node; services came on successfully.

About Me

My Photo
Mohit K. Gupta
I have a B.Sc. Computer Science with Minor in Japanese. I am currently working on my Masters in Information Systems. I spend most of my time on studies for research or for SQL Server related issues. My main interests revolve around Japan, to that regards, I practice Kendo (The Way of the Sword) and have been doing it for now about 5 Years. In addition I like to play Go (or Igo in Japanese) board game like chess, but maybe a bit more complicated. I am hoping to go to Japan in the future to further develop my Japanese, Go and Kendo skills.
View my complete profile

Certification & Degrees

  • MCITP: Database Administrator
  • MCTS: SQL Server 2005
  • B.Sc. Computer Science
  • Japanese Minor