Posts Tagged 'Kernel'

I’ll be at WCRE 2009 presenting NTrace

Next week, the 16th Working Conference on Reverse Engineering (WCRE) will be held in Lille, France. I will be there presenting NTrace: Function Boundary Tracing for Windows on IA-32.

NTrace is a dynamic function boundary tracing toolkit for IA-32/x86 that can be used to trace both kernel and user mode Windows components — examples for components that can be traced include the kernel itself (ntoskrnl), drivers like NTFS as well as user mode components such as kernel32, shell32 or even explorer.exe.

NTrace implements a novel approach to instrumenting IA-32 machine code and integrating with the Structured Exception Handling facility of Windows. Using this approach, NTrace is not only capable of tracing nearly the entire Windows kernel and system libraries, it is also faster than Solaris DTrace FBT on IA-32!

Details on how exactly NTrace works will be publiched in the paper, which will be made available soon. I will also publish more details on NTrace both here and on a dedicated NTrace website.

The work, by the way, is basically the result of my Master’s thesis I wrote back in 2008.


cfix 1.2 introduces improved C++ support

cfix 1.2, which has been released today, introduces a number of new features, the most prominent being improved support for C++ and additional execution options.

New C++ API

To date, cfix has primarily focussed on C as the programming language to write unit tests in. Although C++ has always been supported, cfix has not made use of the additional capabilities C++ provides. With version 1.2, cfix makes C++ a first class citizen and introduces an additional API that leverages the benefits of C++ and allows writing test cases in a more convenient manner.

Being implemented on top of the existing C API, the C++ API is not a replacement, but rather an addition to the existing API set.

As the following example suggests, fixtures can now be written as classes, with test cases being implemented as methods:

#include <cfixcc.h>

class ExampleTest : public cfixcc::TestFixture
  void TestOne() 
  void TestTwo() 


To learn more about the definition of fixtures, have a look at the respective TestFixture chapter in the cfix documentation.

Regarding the implementation of test cases, cfix adds a new set of type-safe, template-driven assertions that, for instance, allow convenient equality checks:

void TestOne() 
  const wchar_t* testString = L"test";
  // Use typesafe assertions...
  CFIXCC_ASSERT_EQUALS( L"test", testString );
  CFIXCC_ASSERT_EQUALS( wcslen( testString ), ( size_t ) 4 );
  // ...log messages...
  CFIX_LOG( L"Test string is %s", testString );
  // ...or use the existing "C" assertions.
  CFIX_ASSERT( wcslen( testString ) == 4 );
  CFIX_ASSERT_MESSAGE( testString[ 0 ] == 't', 
    L"Test string should start with a 't'" );

Again, have a look at the updated API reference for an overview of the new API additions.

Customizing Test Runs

Another important new feature is the addition of the new switches -fsf (Shortcut Fixture), -fsr (Shortcut Run), and -fss (Shortcut Run On Failing Setup). Using these switches allows you to specify how a test run should resume when a test case fails.

When a test case fails, the default behavior of cfix is to report the failure, and resume at the next test case. By specifying -fsf, however, the remaining test cases of the same fixture will be skipped and execution resumes at the next fixture. With -fsr, cfix can be requirested to abort the entire run as soon as a single test case fails.

What else is new in 1.2?


As always, cfix 1.2 is source and binary compatible to previous versions. The new MSI package and source code can now be downloaded on Sourceforge.

cfix is open source and licensed under the GNU Lesser General Public License.

How GUI Thread Conversion on Svr03 Breaks the SEH Chain

The Windows kernel maintains two types of threads — Non-GUI threads, and GUI threads. Non-GUI threads threads use the default stack size of 12KB (on i386, which this this discussion applies to) and the default System Service Descriptor table (SSDT), KeServiceDescriptorTable. GUI threads, in contrast, are expected to have much larger stack requirements and thus use an extended stack size of 60 KB (Note: these are the numbers for Svr03 and may vary among releases). More importantly, however, GUI threads use a different SSDT — KeServiceDescriptorTableShadow. Unlike KeServiceDescriptorTable, which only supports the basic set of system calls, this SSDT also includes all the User and GDI system services.

All threads start off as Non-GUI threads. Once the application makes a call to a system service that does not fall within the default range, however, the NT kernel will suspect this thread to be about to do GUI stuff — and will convert the thread into a GUI thread.

Converting a thread to a GUI thread naturally has to entail two things — swapping the SSDT, and enlarging the stack. While swapping the SSDT is not really interesting, enlarging the stack size poses a challenge — you cannot really enlarge a stack as the nearby pages that would need to be acquired may not be available.

As a consequence, enlarging the stack works by swapping the stack. The old, small stack is exchanged against a newly allocated, larger stack. Now swapping a stack is not really a common thing to do and is pretty easy to get wrong. And well, as it turns out, the Svr03 kernel did in fact get it wrong.

But let’s start at the beginning.

When the number of the requested system service is found to be beyond the range supported by the default SSDT, KiConvertToGuiThread is called to perform the thread conversion. KiConvertToGuiThread itself is pretty dumb and lets PsConvertToGuiThread do the actual work.

The following pseudo code illustrates what PsConvertToGuiThread does:

NTSTATUS PsConvertToGuiThread()
  // Create the new stack.
  LargeStack = MmCreateKernelStack( ... )
  if ( LargeStack == NULL )
      // Allocation failed -- set last error value.
      NtCurrentTeb()->LastErrorValue = ERROR_NOT_ENOUGH_MEMORY;
    __except( ... )
    // N.B. We are still on the old stack.
    // This will copy the old thread's contents to the new stack and 
    // migrate the context of the current thread to the new stack.
    SmallStack = KeSwitchKernelStack( LargeStack, ... );

    // Now we are on the new stack.
    MmDeleteKernelStack( SmallStack, ... );
  // Notify Win32k.
  ( PspW32ProcessCallout )( ... )
  ( PspW32ThreadCallout ) ( ... )

This code looks innocent enough, but infact, it is lying. Too see why, you have to recall how Structured Exception Handling is implemented on i386 and how the C compiler makes use of it (I think I have spent way too much time with SEH over the past months…): The __try/__except-block at the top of the routine will cause to the compiler to emit the typical SEH prolog at the beginning of the function. The purpose of this prolog is to set up an EXCEPTION_REGISTRATION_RECORD and to put this record onto the current thread’s SEH chain, which in turn is rooted in the PCR. In the same way, the compiler will put an appropriate epilog to the end of the routine.

So while the code above suggests that the SEH stuff is scoped to the very beginning of the function, it will not be until the end of the function has been reached that the EXCEPTION_REGISTRATION_RECORD is torn down and removed from the SEH chain.

And at this point, it should become clear why this becomes a problem in the context of stack swapping. At the point where KeSwitchKernelStack is called, the EXCEPTION_REGISTRATION_RECORD will still be listed in the SEH chain, although it does not serve any particular purpose any more. So KeSwitchKernelStack is called, which will, as indicated before, copy the contents of the old stack to the new stack — which, of course, includes the EXCEPTION_REGISTRATION_RECORD.


neither KeSwitchKernelStack, nor PsConvertToGuiThread updates the SEH pointer in the PCR! After the swapping has been conducted and MmDeleteKernelStack has returned, the root of the SEH chain will point to freed memory — memory where the EXCEPTION_REGISTRATION_RECORD once has been.

Now two things are worth noting. First, PsConvertToGuiThread can be expected to occupy the bottommost stack frame of the kernel stack. A situation where the dangling pointer could harm a caller of PsConvertToGuiThread is thus not possible.

Secondly, PsConvertToGuiThread makes callouts to Win32k by invoking the callbacks pointed to by PspW32ProcessCallout and PspW32ThreadCallout. And in fact, it is only PsConvertToGuiThread‘s luck that these routines are so well behaved that they do not cause the system to bugcheck because of the dangling pointer. If one of these routines (or routines called by these) did anything with the SEH chain going beyond adding another record to the chain and removing it later, odds were that this routine would dereference a stray pointer… and would bugcheck the system…

It is worth noting that the implementation of PsConvertToGuiThread has changed in Windows Vista, so that the above discussion does not apply to this and later releases.

cfix 1.1 introduces NT kernel mode unit tests

cfix 1.1 introduces a number of new features. The most important among these is the additional ability to write kernel mode unit tests, i.e. unit tests that are run in kernel mode. Needless to say, cfix 1.1 still supports user mode unit tests.

All contemporary unit testing frameworks focus on unit testing in user mode. Certainly, the vast majority of testing code can be assumed to be targeting user mode, so this does not come at a surprise. Tools for driver testing, of which there are quite a few, focus on integration testing — they usually test whether the driver works in its entirety.

While these tools are very useful indeed, they do not support true unit testing — i.e. offering the ability to test individual routines or subsystems of a driver. To perform such tests, it would be neccessary to write a separate test driver or revert to other techniques such as this one.

cfix 1.1 fills in this gap and offers the ability to write kernel mode tests. That way, individual parts of what may eventually become a driver can thoroughly be tested in isolation, without neccessitating much boilerplate code.


Writing a kernel mode unit test is as easy as writing a user mode unit test — the API is the same for user and kernel mode tests. Even the tools, cfix32 and cfix64 are the same for both modi. The only true difference is that kernel mode tests require slightly different build settings.

The following listing shows an example for a kernel mode unit test — but the same code could just as well be compiled into a user mode unit test.

#include <cfix.h>

static void FixtureSetup()
  CFIX_ASSERT( 0 != 1 );

static void FixtureTeardown()
  CFIX_LOG( L"Tearing down..." );

  Test routine -- do the actual testing.
static void Test1()
  ULONG a = 1;
  ULONG b = 1;
  CFIX_ASSERT( a + b == 2 );
  // You are free to use all WDM APIs here!
  CFIX_LOG( L"a=%d, b=%d", a, b );

  Define a test fixture. 

  CFIX_FIXTURE_SETUP( FixtureSetup )
  CFIX_FIXTURE_TEARDOWN( FixtureTeardown )

Once built, the test can be run from the command line:

C:\cfix\bin\i386>cfix32 -nologo -kern ktest.sys
Module: ktest (ktest.sys)
  Fixture: MyFixture

For a more detailed discussion and more example code, please refer to the tutorial.


For user mode code, the cfix architecture roughly looks like this:

The tests are compiled into a DLL. Using the testrunner application cfix32 or cfix64, one or more fixtures defined in the DLL can be run and the results are reported to the console or to a log file.

For kernel mode code, the acrhitecture looks a little different. The tests are compiled into a driver rather than into a DLL. The driver is verly lightweight and, besides the tests, contains only very little cfix-provided code (basically, just a DriverEntry implementation).

When cfix32 or cfix64 is requested to run a kernel mode tests, it will load the Reflector, a driver that contains the kernel mode fraction of the testing framework. Relaying control operation and output through the reflector, the kernel mode unit tests can be run.

All these additional steps are performed without additional user intervention — the drivers are installed, loaded and stopped automatically. From a user perspective, running a kernel mode tests feels just like running a user mode test.


cfix 1.1 introduces additional new features. I will discuss some of them over the next weeks. In any case, whether you have not used cfix yet or are a cfix 1.0 user, you should go straight to the download page now.

Reaching beyond the top of the stack — illegal or just bad style?

The stack pointer, esp on i386, denotes the top of the stack. All memory below the stack pointer (i.e. higher addresses) is occupied by parameters, variables and return addresses; memory above the stack pointer must be assumed to contain garbage.

When programming in assembly, it is equally easy to use memory below and above the stack pointer. Reading from or writing to addresses beyond the top of the stack is unusual and under normal circumstances, there is little reason to do so. There are, however, situations — rare situations — where it may tempting to temporarily use memory beyond the top of the stack.

That said, the question is whether it is really just a convention and good style not to grab beyond the stack of the stack or whether there are actually reasons why doing so could lead to problems.

When trying to answer this question, one first has to make a distinction between user mode and kernel mode. In user mode Windows, I am unable to come up with a single reason of why usage of memory beyond the top of the stack could lead to problems. So in this case, it is probably merely bad style.

However, things are different in kernel mode.

In one particular routine I recently wrote, I encountered a situation where temporarily violating the rule of not reaching beyond the top of the stack came in handy. The routine worked fine for quite a while. In certain situations, however, it suddenly started to fail due to memory corruption. Interestingly enough, the routine did not fail always, but still rather frequently.

Having identified the specific routine as being the cuplrit, I started single stepping the code. Everything was fine until I reached the point where the memory above the stack pointer was used. The window span only a single instruction. Yet, as soon as I had stepped over the two instructions, the system crashed. I tried it multiple times, and it was prefectly reproduceable when being single-stepped.

So I took a look at the stack contents after every single step I took. To my surprise, as soon as I reached the critical window, the contents of the memory location just beyond the current stack pointer suddenly became messed. Very weird.

After having been scratching my head for a while, that suddenly started to made sense: I was not the only one using the stack — in between the two instructions, an interrupt must have occured and been dispatched. As my thread happened to be the one currently running, it was my stack that has been used for dispatching it. This also explains why it did not happened always unless I was single-stepping the respective code.

When an interrupt occurs and no privilege-level change has to be performed, the CPU will push the EFLAGS, CS and EIP registers on the stack. That is, the stack of whatever kernel thread happens to be the one currently running on this CPU is reused and the memory locations beyond the stack pointer will be overwritten by these three values. So what I initially interpreted as garbage, actually were the contents of EFLAGS, CS and EIP.

On Windows NT, unlike some other operating systems (FreeBSD, IIRC), handling the interrupt, which involves runing the interrupt service routine (ISR) occurs on the same stack as well. The following stack trace, taken elsewhere, shows an ISR being executed on the stack of the interrupted thread:

f6bdab4c f99bf153 i8042prt!I8xQueueCurrentMouseInput+0x67
f6bdab78 80884289 i8042prt!I8042MouseInterruptService+0xa58
f6bdab78 f6dd501a nt!KiInterruptDispatch+0x49
f6bdac44 f6dd435f driver!Quux+0x11a 
f6bdac58 f6dd61db driver!Foobar+0x6f 

Morale of the story: Using memory beyond the current stack pointer is not only bad practice, it is actually illegal when done in kernel mode.

Ksplice — safe enough?

Last week, Ksplice, an automatic system for rebootless Linux kernel security updates gained some attention. The idea of using hotpatching techniques for applying sucurity fixes to the kernel in order to save reboots is not quite new. Not only does Windows support hotpatching as of Windows Server 2003 SP1, there also have have been attempts to introduce a hot updating infrastructure to the Linux kernel before. Anyway, the paper is an instresting read.

The basic idea followed by Kspliace is to analyze the differences between an old (flawed) and a new (fixed) kernel binary. Based on this analysis, Ksplice decides which routines have changed and now need to be updated. Updating routines is performed by replacing the old routine, i.e. execution is redirected from the old to the new routine.

Such redirection requires code to be patched. Patching code is a nontrivial undertaking and always raises the question of safety — after all, uncareful kernel code patching could easily crash the entire system. The paper describes how this problem is dealt with, yet one of the paragraphs caught my attention (page 7):

A safe time to update a function is when no thread’s instruction pointer falls within that function’s text in memory and when no thread’s kernel stack contains a return address within that function’s text in memory.

Before inserting the trampolines, Ksplice captures all of the machine’s processors and checks whether the above safety condition is met for all of the functions being replaced. […]

So in order to ensure safety, Ksplice perfroms a full stack walk for all threads. While this is a sound approach in theory, it usually turns out to be rather problematic in practice. In fact, the only other updating/dynamic instrumentation approach I am currently aware of that also performs stack walks is Paradyn — all other approaches (deliberately) have choosen other ways to perform safe runtime code modifications.

The reason why stack walking is problematic should be obvious — creating perfect stack traces either requires proper debugging information for all modules involved to be available or requires proper stack frames for all routines so that the ebp-chain can be traversed. In practice, debugging information is often not available for all modules. While this is probably less a problem on Linux than on Windows, it is still a problem that cannot be easily dismissed. Finally, optimizations such as Frame Pointer Omission can thwart attempts to perform a stack walk by following the ebp-chain.

The paper is not specific on how exactly these stack walks are performed and how it tries to overcome these problems, so I took a look at the sources. The stack walk is performed by the routine check_stack, which is shown in the following listing (Excerpt from primary.c, lines 283–307):

/* Modified version of Linux's print_context_stack */
check_stack(struct thread_info *tinfo, long *stack)
  int conflict, status = 0;
  long addr;

  while (valid_stack_ptr(tinfo, stack)) {
    addr = *stack++;
    if (__kernel_text_address(addr)) {
      conflict = check_address_for_conflict(addr);
      if (conflict)
        status = -1;
      if (debug >= 2) {
        printk("%08lx ", addr);
        if (conflict)
          printk("[= 2)

  return status;

The parameter stack contains the the frame pointer of the topmost stack frame. Starting from this address, the routine treats every doubleword on the stack as a potential stack frame and sees whether it might represent a return address that points to one of the critical functions. While possibly seeing too many stack frames and generating false positives with this approach, it is in fact a more pessimistic and thus in this context safer approach than walking the stack by following the ebp-chain.

AuxKlibGetImageExportDirectory and forwarders

One of the newer additions to the DDK is the aux_klib library, which, among others, offers the routine AuxKlibGetImageExportDirectory. As its name suggests, AuxKlibGetImageExportDirectory offers a handy way to obtain a pointer to the export directory of a kernel module.

There is, however, one issue that — at least in my opinion — renders AuxKlibGetImageExportDirectory pretty much useless in most scenarios: Dealing with forwaders.

The primary motivation to call AuxKlibGetImageExportDirectory is to either enumerate the exports of a module or to find a specific export. In both cases, the code is likely to call at least one of the exported routines. To maintain binary compatibility, it would be risky for such code to rely on the fact that all exports that it aims to call are in fact ‘real’ exports and not forwarders. Rather, it is crucial to be prepared to find both types — exports and forwarders — in the export directory and handle each of them appropropriately.

So we need to tell an export from a forwarder. As it turns out, this is not quite as easy as checking some flag. Quoting the Microsoft Portable Executable and Common Object File Format Specification on the content of the export address table:

Each entry in the export address table is a field that uses one of two formats in the following table. If the address specified is not within the export section (as defined by the address and length that are indicated in the optional header), the field is an export RVA, which is an actual address in code or data. Otherwise, the field is a forwarder RVA, which names a symbol in another DLL.

And this exactly is the problem — only being provided the PIMAGE_EXPORT_DIRECTORY pointer, we do not know the start and end RVA of the export section. As a consequence, identifying forwarders is infeasible when using AuxKlibGetImageExportDirectory — which in turn makes it a pretty much useless function.


Although AuxKlibGetImageExportDirectory is handy, the work it performs is rather trivial. Therefore, it is not hard to come up with code that, given the Load Address of a module, finds the export directory and properly checks for the existance of forwarders. The following code shows how:


PULONG FunctionRvaArray;
PUSHORT OrdinalsArray;

ULONG Index;

// Peek into PE image to obtain exports.
  PtrFromRva( DosHeader, DosHeader->e_lfanew );
if( IMAGE_NT_SIGNATURE != NtHeader->Signature )
  // Unrecognized image format.
  return ...;

ExportDataDir = &NtHeader->OptionalHeader.DataDirectory

ExportDirectory = ( PIMAGE_EXPORT_DIRECTORY ) PtrFromRva( 
  ExportDataDir->VirtualAddress );

if ( ExportDirectory->AddressOfNames == 0 ||
   ExportDirectory->AddressOfFunctions == 0 ||
   ExportDirectory->AddressOfNameOrdinals == 0 )
  // This module does not have any exports.
  return ...;

FunctionRvaArray = ( PULONG ) PtrFromRva(
  ExportDirectory->AddressOfFunctions );

OrdinalsArray = ( PUSHORT ) PtrFromRva(
  ExportDirectory->AddressOfNameOrdinals );

for ( Index = 0; Index < 
      ExportDirectory->NumberOfNames; Index++ )
  // Get corresponding export ordinal.
  USHORT Ordinal = ( USHORT ) OrdinalsArray[ Index ] 
    + ( USHORT ) ExportDirectory->Base;

  // Get corresponding function RVA.
  ULONG FuncRva = 
    FunctionRvaArray[ Ordinal - ExportDirectory->Base ];

  if ( FuncRva >= ExportDataDir->VirtualAddress && 
     FuncRva < ExportDataDir->VirtualAddress 
       + ExportDataDir->Size )
    // It is a forwarder.
    // It is an export.


About me

Johannes Passing lives in Berlin, Germany and works as a Solutions Architect at Google Cloud.

While mostly focusing on Cloud-related stuff these days, Johannes still enjoys the occasional dose of Win32, COM, and NT kernel mode development.

He also is the author of cfix, a C/C++ unit testing framework for Win32 and NT kernel mode, Visual Assert, a Visual Studio Unit Testing-AddIn, and NTrace, a dynamic function boundary tracing toolkit for Windows NT/x86 kernel/user mode code.

Contact Johannes: jpassing (at) hotmail com

LinkedIn Profile
Xing Profile
Github Profile