SQLServerWiki

“The Only Thing That Is Constant Is Change”

Database is still in the recovery. A bummer for DBA’s!

Posted by database-wiki on October 6, 2011

ISSUE:

=====

My Customer Database Michaels was still in the recovery for a long time. This happened when customer rebooted server after applying a windows patch.

Note:

This might happen even if there are no transactions to roll forward/back. The error log does not show any progress during recovery.

NOTE: If you are under this scenario i would suggest you to open a ticket with Microsoft PSS because identify the issue requires collecting memory dump’s and analysis it.

This issue can show up during one of the following activities:

1. Startup of database from restart of SQL Server or Backup/Restore or Attach/Detach or from auto close.

2. Log reader latency when db is used in transactional replication: The reader thread of the logreader process is responsible for scanning the transaction log to identify trying to process the transaction.

CAUSE:

======

About VLF:
=========

http://technet.microsoft.com/en-us/library/ms179355(SQL.90).aspx

The SQL Server Database Engine divides each physical log file internally into a number of virtual log files. Virtual log files have no fixed size, and there is no fixed number of virtual log files for a physical log file. The Database Engine chooses the size of the virtual log files dynamically while it is creating or extending log files. The Database Engine tries to maintain a small number of virtual files. The size of the virtual files after a log file has been extended is the sum of the size of the existing log and the size of the new file increment. The size or number of virtual log files cannot be configured or set by administrators.
The only time virtual log files affect system performance is if the log files are defined by small size and growth_increment values. If these log files grow to a large size because of many small increments, they will have lots of virtual log files. This can slow down database startup and also log backup and restore operations. We recommend that you assign log files a size value close to the final size required, and also have a relatively large growth_increment value.

This can happen if the log file has grown several times (usually as a result of very small autogrowth setting). As a result, the log will have several 1000 to several million VLFs. The SQLServerLogManager indicates that its calculating the number of Virtual log files before analyzing it. There seems to be Thousands of  virtual log files(usually as a result of very small autogrowth setting)

The first phase of recovering a database is called discovery where all the VLFs are scanned (in serial and single threaded fashion) and then inserted into a linked list in the sequence before actual recovery starts. Since this happens much before the analysis phase, there are no messages indicating the progress in the errorlog. Depending on the number of VLFs this initial discovery phase can take several hours even if there are no transactions in the log that need to be processed. This is BY DESIGN.

From our dump analysis it seems in our scenario we have over 8 millions VLF’s roughly.

EXTERNAL DOCUMENTATION:

========================

http://blogs.msdn.com/psssql/archive/2009/05/21/how-a-log-file-structure-can-affect-database-recovery-time.aspx

DATA COLLECTED:

===============

1. Error log from the time of recovery.

2. DBCC LOGINFO(dbname)–for
the database in question. This will have to be collected when db is online. (not
collected so far since database Michaels is still in recovery)

3. Couple of mini-dumps few seconds apart (if 1 and 2 do not provide a clear answer).

First Cause:

=========

1. To confirm that you are running into this problem, check the error log for the database in recovery. The last message you will see for that database(while in recovery) in the log:

2009-11-26 02:03:58.61
spid23s     Starting up database ‘Michaels’.

After which there is no status related to the recovery of ‘Michaels’

2. DBCC LOGINFO(dbname) — Since you will not be able to collect this during recovery, run this as soon as recovery finishes to give you an idea of how many VLFs the log file has.

Note: This query can take a long time to run depending on number of VLFs.

3. Mini-dumps taken during the discovery phase will show the following functions at the top of the stack where the linked list is created:

From dumps file SQLDump0004.mdmp that’s been collected by running dbcc stackdump. Recovery waits for the Michaels database as highlight below.

Call Stack

=======

sqlservr!SQLServerLogMgr::InsertFile

sqlservr!SQLServerLogMgr::AddNewLFCBs

sqlservr!SQLServerLogMgr::Init

sqlservr!RecoveryUnit::StartupPreRecovery

sqlservr!DBTABLE::Startup

sqlservr!DBMgr::StartupDB

sqlservr!DBMgr::StartDBsInParallel

sqlservr!DBMgr::ParallelDbStartupEntryPoint

sqlservr!SOS_Task::Param::Execute

sqlservr!SOS_Scheduler::RunTask

sqlservr!SOS_Scheduler::ProcessTasks

sqlservr!SchedulerManager::WorkerEntryPoint

sqlservr!SystemThread::RunWorker

sqlservr!SystemThreadDispatcher::ProcessWorker

sqlservr!SchedulerManager::ThreadEntryPoint

msvcr80!_callthreadstartex

msvcr80!_threadstartex

kernel32!BaseThreadStart

databaseName = wchar_t [129] “Michaels??”

physicalFileName = wchar_t [261]
“G:\Data\Michaels.mdf???”

0:034> .formats esi

Evaluate expression:

Hex:     00000000`007ae3dc

Decimal: 8053724 <—–
number of VLFs (Approximately 8 million)

Octal:  0000000000000036561734

Binary:  00000000 00000000 00000000 00000000 00000000 01111010 11100011 11011100

Chars:  …..z..

Time:    Sat Apr 04 10:38:44 1970

Float:   low 1.12857e-038 high 0

Double: 3.97907e-317

ACTION PLAN:

===========

We have to wait till the recovery completes. (we don’t
have a time frame for that).
If you have backup, restore it to
another server to retrieve business critical data from the database if you are worried only about the data.

Once the recovery has complete, please follow the below recommendations to avoid our current
situation.

1. Run DBCC SHRINKFILE to reduce the ldf(s) to a small size, thereby reducing the number of VLFs.

2. Run DBCC LOGINFO(dbname) and make sure the number of VLFs is < 500-1000. (Hint: The number of rows returned by the DBCC command correlates to number of VLFs). The FileSize column indicates size of each VLF.

3. Expand(resize) the log file to the desired size using a single growth operation. This can be achieved by setting a new Initial Size for the ldf(s) in the Database Properties->Files->Database files section.

4. Run DBCC LOGINFO(dbname) and verify that the number of VLFs is still a small number. NOTE: Step 3 may create multiple VLFs even if you specify a large size. However, each VLF will
be large.

5. Turn-off auto close. This can have adverse effects every time a database has to be started.

6. Backup the database.

Associated KB’s:

============

KB949523 The latency of a transactional replication is high in SQL Server 2005 when the value of the “Initial Size” property and the value of the Autogrowth property are small

KB907511 How to use the DBCC SHRINKFILE statement to shrink the transaction log file in SQL Server 2005

Second Cause:

===========

In case of the issue with logreader latency, you may see the following stack when the database is online.

check the error log for the database in recovery. The last message you will see for that database(while in recovery) in the log:

2011-02-13 21:09:54.28
spid24s     Recovery of database ‘RIGHTANGLE’ (5) is 52%
complete (approximately 38735 seconds remain). Phase 2 of 3. This is an
informational message only. No user action is required.

Call Stack

========

sqlservr!SQLServerLogMgr::FindFile

sqlservr!SQLServerLogIterForward::ComputeNextLSN

sqlservr!SQLServerLogIterForward::NextReadAheadLSN

sqlservr!SQLServerLogIterForward::TryLogReadAhead

sqlservr!SQLServerLogIterForward::GetNextBlock

sqlservr!SQLServerLogIterForward::GetNext

sqlservr!RecoveryMgr::RedoPass

sqlservr!RecoveryMgr::RecoverUpToRollback

sqlservr!RecoveryUnit::StartupRecovery

This is a slow recovery due to too many VLFs. FindFile seems to be the
bottleneck. Does a synchronous read from the vlf header in the .ldf file. One of
the main issues here is that SQLServerLogMgr::FindFile does a sequential scan of
the entire list of VLFs starting from the beginning for every log block. If they
have too many VLFs in the log file, looking up VLFs in the tail of the log will
need to scan through all entries. None of the released versions (Katmai/KJ) of
the product have fixed the SQLServerLogMgr::FindFile problem.

Third Cause:

==========

2011-03-07 17:53:38.21 spid15s
Recovery of database ‘rework_wlp’ (7) is 0% complete (approximately 28660
seconds remain). Phase 1 of 3. This is an informational message only. No user
action is required.

….

….

2011-03-07 18:41:45.91 spid15s
Recovery of database ‘rework_wlp’ (7) is 99% complete (approximately 1 seconds
remain). Phase 2 of 3. This is an informational message only. No user action is
required.

=> Recovery start time at 2011-03-07 17:53:38.21

=>  Last message printed by recovery process at 2011-03-07 18:41:45.91

From sysprocesses:

spid   kpid   blocked waittype waittime             lastwaittype                     waitresource                                                                                                                                                                                                                                                     dbid   uid    cpu         physical_io          memusage    login_time              last_batch              ecid   open_tran status                         sid                                                                                                                                                                            hostname                                                                                                                         program_name                                                                                                                     hostprocess cmd              nt_domain                                                                                                                        nt_username                                                                                                                      net_address  net_library  loginame                                                                                                                         context_info                                                                                                                                                                                                                                                       sql_handle                                 stmt_start  stmt_end    request_id
—— —— ——- ——– ——————– ——————————– —————————————————————————————————————————————————————————————————————————————————————- —— —— ———– ——————– ———– ———————– ———————– —— ——— —————————— —————————————————————————————————————————————————————————— ——————————————————————————————————————————– ——————————————————————————————————————————– ———– —————- ——————————————————————————————————————————– ——————————————————————————————————————————– ———— ———— ——————————————————————————————————————————– —————————————————————————————————————————————————————————————————————————————————————— —————————————— ———– ———– ———–
11     7652   0       0x0064   1                    SLEEP_BPOOL_FLUSH                                                                                                                                                                                                                                                                                 7      1      734343      108                  0           1900-01-01 00:00:00.000 1900-01-01 00:00:00.000 0      0         background                     0x0100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000                                                                                                                                                                                                                                                                               CHECKPOINT                                                                                                                                                                                                                                                                                                   sa                                                                                                                               0x0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 0x0000000000000000000000000000000000000000 0           0           0
15     8628   0       0x0061   353                  IO_COMPLETION                                                                                                                                                                                                                                                                                     7      1      2352651     89513668             0           1900-01-01 00:00:00.000 1900-01-01 00:00:00.000 0      0         background                     0x0100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000                                                                                                                                                                                                                                                                               DB STARTUP                                                                                                                                                                                                                                                                                                   sa                                                                                                                               0x0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 0x0000000000000000000000000000000000000000 0           0           0
54     6860   0       0x0000   0                    MISCELLANEOUS                                                                                                                                                                                                                                                                                     7      1      63          3                    2           2011-03-08 00:28:22.743 2011-03-08 01:39:17.743 0      0         runnable                       0x0100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 CHICRWAS001                                                                                                                      Microsoft SQL Server Management Studio – Query                                                                                   504         SELECT                                                                                                                                                                                                                                                                             84A48C26FC1D TCP/IP       sa                                                                                                                               0x0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 0x010007008138DE11F0D8EE881000000000000000 0           -1          0
Spid 11 is the checkpointing spid and spid 15 is the recovery spid.

CALL STACK WILL BE:

sqlservr!SQLServerLogMgr::CheckLogBlockReadComplete
sqlservr!SQLServerLogIterBackLink::WaitforLogAvailable(void)
sqlservr!SQLServerLogIterBackLink::GetNextBlock(void)
sqlservr!SQLServerLogIterBackLink::GetNext(void)
sqlservr!XdesRMReadWrite::RollbackToLsn(class LSN * lastUndoLsn = 0x00000000`1361ee00, class LogIterBackLink * logScan = 0x00000000`1361ed80, class IndexErrorTable * indexErrors = 0x00000000`00000000, class AllocUnitId * noLatchAllocUnitId = 0x00000000`034c2c20)
sqlservr!RecoveryMgr::UndoRollback(void)
sqlservr!RecoveryMgr::RollbackRemaining(void)
sqlservr!RecoveryUnit::DoRollbackRecovery(void)
sqlservr!DBMgr::RunRollbackRecovery(class RollbackRecoveryPkt * rollbackPkt = 0x00000000`00000000)

The cause of the issue is not due to large number of VLF but a disk bottleneck.

ACTION PLAN:

  1. Install latest update for SQL Server 2005 or SQL Server 2008 to eliminate the know
    recovery issues. ( this will restart SQL Server 2008 instance but an option
    worth trying. )
  2. As recovery process is waiting on I/O complete we might be encountering a disk bottleneck, Check with your system engineer to see if the disk hold the database files are
    performing at optimal level. ( running perfmon is a good option )
  3. If step 1 and step 2 e implement. We need to wait till the database completes the recovery.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: