Debugging the MaxDB Kernel on Windows – How to get a Call stack from hanging tasks

Ok, today I’ll show you how to install, setup and use the Windows Debugger to get information about hanging MaxDB kernel tasks.
The first thing we need is obviously a hanging MaxDB session, so let’s produce one!

Setting up the test case

My MaxDB instance is a 7.6.03 Build 9 instance with only one 50 MB data volume. The goal of the test setup is to create a DB FULL situation, since this is a rather common MaxDB hanging situation when DATA AREA AUTOEXTEND is not active.
Therefore I create a simple table in SQL Studio:

create table test (col1 int primary key)

and fill it with LOTS of data (notice the Cartesian product of the three-way join with the files table):

insert into test (select rownum from files a, files b, files c)

After a short amount of time my DB will be filled up. This is something the DBMGUI really tells you:

For our analysis of this situation we need some more information, especially we need the Windows thread ID of our hanging session. This information is available via the x_cons tool (you may as well use the dbmcli-command ‘db_cons’ or the DBMGUI function CHECK -> Database Server -> ACTIVE:

x_cons NDB show active:

ID   UKT  Win   TASK       APPL Current         Timeout Region     Wait
          tid   type        pid state          priority cnt try    item 
T17    7  0x4EC User       3004 DB FULL  (197)        0 0               791(s)
T33    4  0xD1C Savepnt         Vsleep                0 0               262(s)

As we see only two tasks are currently ACTIVE. One (T33) is the SAVEPOINT that is triggered automatically whenever a LOG FULL/DB FULL situation occurs.
The second one (T17) is our hanging session and we see it’s waiting on the DB FULL (197) suspend reason. Note the ‘Win tid’ column; it’s what we’re looking for right now. Our user task has the Windows thread ID 0x4EC in Hex-Format or 1260 in decimal format.

Installing the Windows Debugger

Now we’ve two options to get a call stack. The first one is to use the MS Windows Debugger. This tool is available on the Microsoft Website for free download.

Debugging Tools and Symbols: Getting Started (http://www.microsoft.com/whdc/devtools/debugging/debugstart.mspx)

Depending on your Platform either choose “Install Debugging Tools for Windows 32-bit Version” or “Install Debugging Tools for Windows 64-bit Versions” and install the tools on the machine where our database runs.

I installed mine into the standard path ‘C:\\Program Files\\Debugging Tools for Windows’.

The next things we need are the symbol files.
These file contain the information that make it possible that the debugger can tell us, what function is at what memory address of any windows program.
These symbols can either be downloaded completely via the ‘Download Windows Symbol Packages’ or we can tell the debugger to load them on-demand while we debug. As my machine is connected to the Internet.
I go for this option and show you how to configure in the next few paragraphs.

Debugging our user task

Now, we just start the WinDbg.exe (after the installation there is a new entry in the START-menu).
In order to analyze anything, we’ve to tell the Debugger where to find all the symbols. Via the Menu  ‘File’-> ‘Symbol File Path’ we get a dialogue like this:

We’ve to enter a PATH Variable here, so we can put in several paths limited by the semicolon sign ‘;’. Since we want to have the Windows symbols automatically downloaded from the MS website the first path we enter is this one:

SRV*c:\\websymbols*http://msdl.microsoft.com/download/symbols;

The bold printed part denotes the local folder where the debugger should store the downloaded symbol files. Next, we need to supply the paths to the MaxDB symbols. Since these are delivered by default in each installation I just have to put in these paths on my machine:

C:\\sapdb\\nbd\\db\\symbols;C:\\sapdb\\programs\\symbols

As you see there is a ‘symbols’-Folder in each DEPENDENT folder and one in the INDEPENDENT-folder.

Now the debugging can start.
To do so we need to attach the debugger to the process we want to debug. This can be done via ‘File’ -> ‘Attach to process [F6]‘.
We get a list of running processes and choose our ‘KERNEL.EXE’:

Make sure you check the box ‘Noninvasive’ since we want our process to run further after our debugging and all we want to do now is ‘just looking’!

If you’re unsure which one of the possibly several KERNEL.EXE-processes on your machine would be the right one, just look up the process ID in the knldiag-File in the startup part:

INFO     Starting SERVERDB:      'NDB'
INFO              SERVERNODE:    'xxxxxxxxxnotvisibletopublicxxxxxx'
INFO              Process ID:    3028                         
INFO              Date:          2007-11-14
INFO              Owner:         'sdbUser'
INFO              Machine:       'Intel IA32 level 6 revision d06'
INFO              Processors:       1
INFO              Fiber:         'YES'
INFO              Max virtual memory:        2047 MB
INFO              Total physical memory:     1022 MB
INFO              Available physical memory: 419 MB

Ok, once the debugger is attached the rest is really easy.

We click on ‘Display Threads’ [1] first and choose a thread in the list. Now we need to remember our Windows thread ID from above: 0x4EC.
After we select our thread we click on ‘show Callstack’ [2] et voilà:

That’s the call stack of our hanging session.
To read it more easily or to hand it over to support for analysis we want a text-version of it. Nothing easier than that: right-click on the Windows-Title bar and click ‘Copy stack to clipboard’ in the context menu. Now paste it into notepad and we’re done.

To stop our debugging session we just select ‘Debug’ -> ‘DETACH DEBUGEE’ from the menu.

Process Explorer Approach

What we’ve right now can also be easily archived with the Sysinternals tool Process Explorer. All I can say about the Sysinternals tool suite for Windows is: get it onto your servers!
Whenever there is anything weird to analyze these tools have been reliable sources of in-depth system information.
So double recommendation: install them! Today!
The tools are also available for free at the MS website:

Windows Sysinternals (http://www.microsoft.com/technet/sysinternals/default.mspx)

Process Explorer http://www.microsoft.com/technet/sysinternals/Utilities/ProcessExplorer.mspx)

To be able to actually use symbols and produce readable call stacks we need to setup the process explorer a little bit.
It also needs to have the Windows Debugger tools installed, since it makes use of the special Dbghelp.dll that comes with them.
All we need to do now is to tell process explorer where this DLL is stored and enter our symbols path (see above):

Now we just browse through the process overview and double-click on our KERNEL.EXE.
The process ID from the knldiag-file can again help us here to find ‘our’ process:

Now we browse through the threads until we find thread ID 1260 (= Hex 0x4EC) and click on [Stack]:

Again we can use ‘copy’ to save this information to a text file or a support message.

Interpreting the call stack

The call stack we just created looks like this on my test machine:

00 ntdll!KiFastSystemCallRet
01 ntdll!ZwWaitForSingleObject+0xc
02 kernel32!WaitForSingleObjectEx+0xa8
03 kernel32!WaitForSingleObject+0x12
04 kernel!sql41c_wait_event_sem+0x11
05 kernel!__os74WaitForWakeup+0xa4
06 kernel!RTETask_TaskScheduler::NextScheduledTask+0x19f
07 kernel!os74GotoDispatcher+0x1f
08 kernel!vsuspend+0x153
09 kernel!Kernel_SynchronizedTaskWaitList::Wait+0x12
0a kernel!Converter_Converter::RequestNewPageNo+0x18a
0b kernel!Converter_Converter::RequestNewPermPageNo+0x22
0c kernel!b13new_node+0x92
0d kernel!Data_PageAccessManager::NewPage+0x1e1
0e kernel!Data_ChainPage::Append+0xbd
0f kernel!Data_ChainSplitSpaceBackwardRead<Log_UndoPage>::ReserveSpace+0x761
10 kernel!Log_UndoFile::WriteUndoEntry+0x74
11 kernel!Log_Transaction::WriteUndoEntry+0x1d2
12 kernel!Log_ReadWriteActionImage::WriteBeforeImage+0x9d
13 kernel!kb611ins_WriteBeforeImage+0x16
14 kernel!k61bd_ins+0xde
15 kernel!bd30AddToTreeWithBeforeImage+0x153
16 kernel!b02kb_ins_record+0x17e
17 kernel!kb61insert_rec+0xa7
18 kernel!k61ins_select+0x13a
19 kernel!kb721result_handle+0x271
1a kernel!kb721select_gets+0x14f
1b kernel!kb721seq_search+0x2b8
1c kernel!k721strat_distribution+0x177
1d kernel!k720_select+0x402
1e kernel!k720_single_select+0x24
1f kernel!k61table_ins_del_upd+0x34
20 kernel!k05functions+0x60c
21 kernel!a06lsend_mess_buf+0x349
22 kernel!a06dml_send_mess_buf+0x43
23 kernel!a505most_execute+0xbf4
24 kernel!a505loop_most_execute+0x332
25 kernel!a501exec_with_change_rec+0x3a3
26 kernel!a502complex_execution+0x407
27 kernel!a501exec_with_change_rec+0x299
28 kernel!a501execute+0x2af
29 kernel!ak92analyze_messagetype+0x464
2a kernel!a92_mode_analyzer+0x97
2b kernel!ak93one_command+0x585
2c kernel!a93_user_commands+0x40e
2d kernel!SQLTask+0x7b
2e kernel!Kernel_Main+0x223
2f kernel!RTETask_TaskMain+0x4c
30 kernel!sql88k_kernel_fiber+0xd4
31 kernel32!BaseThreadStart+0x37

To ‘read’ the call stack we start from the button and move upwards (that’s how a stack works). I will just comment on a few of these lines now.
We see at line 2d that this is a user task ‘kernel!SQLTask+0x7b’ that currently handles a statement that changes the DB content ’27 kernel!a501exec_with_change_rec+0x299′.
Since the statement is not a single-table statement but involves a sub-select we see ’26 kernel!a502complex_execution+0x407′ before that.
A bit further up we see that the table-content will be changed 1f kernel!k61table_ins_del_upd+0x34′ and also we see that the select of the subquery is handled ‘1e kernel!k720_single_select+0x24’ and later it’s inserted into our target table ’17 kernel!kb61insert_rec+0xa7′.
An insert need, of course, some work on the b*tree in which the table is stored so we see ’15 kernel!bd30AddToTreeWithBeforeImage+0x153 and know by now that this transaction will be rollback-able.
After all Undo- and Log-Information has been written (yes, we follow the WAL Write-Ahead-Log approach) a new page needs to be allocated, to hold our data ‘0d kernel!Data_PageAccessManager::NewPage+0x1e1′.
As this page has to be provided by the Converter we find ‘0a kernel!Converter_Converter::RequestNewPageNo+0x18a’.
The Converter than realizes that there is no space available anymore, so it suspends our task and lets it wait for the notification that the DB FULL situation is gone ’05 kernel!__os74WaitForWakeup+0xa4‘.

As we see it’s not too difficult to understand what tasks are waiting for once we have the necessary information (the call stack).
This was just a demo situation but in other hanging situation this can
help to find the cause.
So better install the Windows Debugger and the Sysinternals tools (plus the Symbols if no Internet is available) on your Windows server today!

Cleaning up

But one last thing: how to solve our Data Full Situation?
In MaxDB you HAVE (sorry no more ways to stress this via formatting…) to add a data volume now. There’s no way to cancel or kill the user task (that is the user session) that causes the big data load even if it’s not wanted beforehand.
So what to do when we don’t really want the data or cannot provide much storage space right now?
Well, we kill the user task anyhow!

<Local>:NDB> db_cons kill T17
OK

Now, we’ll see the following with ‘show active’

<Local>:NDB> db_cons show active
OK
SERVERDB: NDB
ID   UKT  Win   TASK       APPL Current         Timeout Region     Wait 
          tid   type        pid state          priority cnt try    item 
T17    7  0x4EC User       3004 DB FULL  (197) !      0 0               793(s)
T33    4  0xD1C Savepnt         Vsleep                0 0               2358(s)

The ‘!’ sign denotes that this user task has been marked to be killed.
Due to the cooperative multitasking model on user task level, the user task is not killed by some authority but has to commit ‘suicide’.
It first has to be able to read the KILL flag and react on this.
But to be able to do so, it must get out of the DB FULL situation.

So we really need to provide a new data volume here to get the task actually killed.
Therefore I add just a very small volume. Since the task will be killed the transaction that loads the data will be rolled back. In effect, the database will have much more free space again afterwards.

After the rollback is done I can simply drop the data volume again – this is the point where we are happy that this database is already on 7.6 😉
MaxDB Versions <7.5 are not able to drop volumes!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.