Runtime Code Modification Explained, Part 4: Keeping Execution Flow Intact

 

Concurrent Execution

A typical user mode process on a Windows system can be expected to have more than one thread. In addition to user threads, the Windows kernel employs a number of system threads. Given the presence of multiple threads, it is likely that whenever a code modification is performed, more than one thread is affected, i.e. more than one thread is sooner or later going to execute the modified code sequence.

The basic requirement that has to be met is that even in the presence of a preemptive, multi threaded, multiprocessing environment, an instrumentation solution has to ensure that any other thread either does not run the affected code at all, runs code not yet reflecting the respective modifications or runs code reflecting the entire set of respective modifications.

On a multiprocessor system, threads are subject to concurrent execution. While one thread is currently performing code modifications, another thread, running on a different processor, may concurrently execute the affected code.

If only a single instruction is to be modified and the cited algorithm for cross-modifying code is used, concurrent execution, preemption and interruption should not be of concern. Any other thread will either execute the old or new instruction, but never a mixture of both.

However, the situation is different when more than one instruction is to be modified. In this case, a different thread may execute partially modified code.

Although code analysis may indicate certain threads not to ever call the routine comprising the affected code, signals or Asynchronous Procedure Calls (APCs) executed on this thread may. Therefore, a separation in affected and non-affected threads may not always be possible and it is safe to assume that all threads are potentially affected.

Preemption and Interruption

Both on a multiprocessor and a uniprocessor system, all threads running in user mode as well as threads running in kernel mode at IRQL APC_LEVEL or below are subject to preemption. Similarly, for a thread running at DISPATCH_LEVEL or Device IRQL (DIRQL), it is also possible to be interrupted by a device interrupt. As these situations are similar, only the case of preemption is discussed.

If only a single instruction is to be modified, preemption and interruption may not be problematic. If, however, multiple instructions are to be adapted, the ramifications of preemption in this context are twofold. On the one hand, the code performing the modification may be preempted while being in the midst of a multi-step runtime code modification operation:

  • Thread A performs a runtime code modification. Before the last instruction has been fully modified, the thread is preempted. The instruction stream is now in a partially-modified state.
  • Thread B begins executing the code that has been modified by Thread A. In case instruction boundaries of old and new code match, the instruction sequence that is now run by Thread B should consist of valid instructions only, yet the mixture of old and new code may define unintended behavior. If instruction lengths do not match, the situation is worse. After Thread B has executed the last fully-modified instruction, the CPU will encounter a partially-overwritten instruction. Not being aware of this shift of instruction boundaries, the CPU will interpret the following bytes as instruction stream, which may or may not consist of valid instructions. As the code now executed has never been intended to be executed, the behavior of Thread B may now be considered arbitrary.

In order to avoid such a situation from occurring, an implementation can disable preemption and interruption by raising the IRQL above DIRQL during the modification process.

On the other hand, the code performing the code modification may run uninterrupted, yet one of the preempted threads might become affected:

  • Thread A has begun executing code being part of the instruction sequence that is about to be modified. Before having completed executing the last instruction of this sequence, it is preempted.
  • Thread B is scheduled for execution and performs the runtime code modification. Not before all instructions have been fully modified, it is preempted.
  • Thread A is resumed. Two situations may now occur — either the memory pointed to by the program counter still defines the start of a new instruction or — due to instruction boundaries having moved — it points into the middle of an instruction. In the first case, a mixture of old and new code is executed. In the latter case, the memory is reinterpreted as instruction stream. In both cases, the thread is likely to exhibit unintended behavior.

One approach of handling such situations is to prevent them from occurring by adapting the scheduling subsystem of the kernel. However, supporting kernel preemption is a key characteristic of the Windows NT scheduler — removing the ability to preempt kernel threads thus hardly seems like an auspicious approach. Regarding the Linux kernel, however, it is worth noting that kernel preemption is in fact an optional feature supported on more recent versions (2.6.x) only. As a consequence, for older versions or kernels not using this option, the situation as described in the previous paragraph cannot occur.

A more lightweight approach to this problem relies on analysis of concurrently running as well as preempted threads. That is, the program counters of all threads are inspected for pointing to regions that are about to be affected by the code modification. If this is the case, the code modification is deemed unsafe and is aborted. Needless to say, it is crucial that all threads are either preempted or paused during this analysis as well as during the code modification itself. As the thread performing the checks and modifications is excluded from being paused and analyzed, it has to be assured that this thread itself is not in danger of interfering with the code modification.

In a similar manner, the return addresses of the stack frames of each stack can be inspected for pointing to such code regions. Stack walking, however, is exposed to a separate set of issues that I’ll disuss separately.

Rather than aborting the operation in case one of the threads is found to be negatively affected by the pending code modification, a related approach is to attempt to fix the situation. That is, the program counters of the affected threads are updated so that they can resume properly.

One example for a user-mode solution implementing this approach is Detours. Before conducting any code modification, Detours suspends all threads a user has specified as being potentially affected by this operation. After having completed all code modifications, all suspended threads are inspected and their program counters are adapted if necessary. Not before this step has completed, the threads are resumed.

Basic Block Boundaries

Another issue of multiple instruction modification is related to program flow. Whenever a sequence of instructions that is to be altered spans multiple basic blocks, it is possible that not only the first instruction of the sequence, but also one of the subsequent instructions may be a branch target. When instruction boundaries are not preserved by the code modification step, the branch target might fall into in the midst of one of the new instructions. Again, such a situation is likely to lead to unintended program behavior.

Identifying basic blocks and thus any potential branch targets requires flow analysis. However, especially in the case of optimized builds, it is insufficient to perform an analysis of the affected routine only as blocks might be shared among more than one routine. In such cases, a routine does not consist of a contiguous region of code but may be scattered throughout the image. Therefore, it is crucial to perform flow analysis on the entire image. But even in this case, the existence of indirect branches may render a complete analysis impossible in practice.

Another situation where an instrumentation solution may run into the danger of overwriting basic block boundaries is the instrumentation of very short routines. If the routine is shorter (in terms of instruction bytes occupied) than the
instructions that need to be injected in order to instrument the routine, the first basic block(s) of the subsequent routine may be overwritten.

Runtime Code Modification Explained, Part 3: Cross-Modifying Code and Atomicity

Performing modifications on existing code is a technique commonly encountered among instrumentation solutions such as DTrace. Assuming a multiprocessor machine, altering code brings up the challenge of properly synchronizing such activity among processors.

As stated before, IA-32/Intel64 allows code to be modified in the same manner as data. Whether modifying data is an atomic operation or not, depends on the size of the operand. If the total number of bytes to be modified is less than 8 and the target address adheres to certain alignment requirements, current IA-32 processors guarantee atomicity of the write operation.

If any of these requirements do not hold, multiple write instructions have to be performed, which is an inherently non-atomic process. What is often ignored, however, is that even in situations where using atomic writes or bus locking (i.e. using the lock prefix) on IA-32 or AMD64 would be feasible, such practice would not necessarily be safe as instruction fetches are allowed to pass locked instructions. Quoting 7.1.2.2 of the Intel manual:

Locked operations are atomic w.r.t. all other memory operations and all externally visible events. Only instruction fetch and page table accesses can pass locked instructions. Locked instructions can be used to synchronize data written by one processor and read by another processor.

Although appealing, merely relying on the atomicity of store operations must therefore in many cases be assumed to be insufficient for ensuring safe operation.

The exact behavior in case of runtime code modifications also slightly varies among different CPU models. On the one hand, guarantees concerning safety of such practices have, as indicated before, been lessened over the evolvement from the Intel 486 series to the current Core 2 series. On the other hand, certain steppings of CPU models even exhibit defective behavior in this regard, as explained in several Intel errata, including this one for the Pentium III Xeon.

Due to this variance, the exact range of issues that can arise when performing code modifications is not clear and appropriate countermeasures cannot be easily identified. As described in these errata, cross-modifying code not adhering to certain coding practices described later, can lead to “unexpected execution behavior”, which may include the generation of exceptions.

The route chosen by the Intel documentation is thus to specify an algorithm that is guaranteed to work across all processor models — although for some processors, it might be more restricting than necessary.

For cross-modifying code, the suggested algorithm makes use of serializing instructions. The role of these instructions, cpuid being one of them, is to force any modifications to registers, memory and flags to be completed and to drain all buffered writes to memory before the next instruction is fetched and executed.

Quoting the algorithm defined in the Intel manual:


(* Action of Modifying Processor *)
Memory_Flag <- 0; (* Set Memory_Flag to value other than 1 *)
Store modified code (as data) into code segment;
Memory_Flag <- 1;

(* Action of Executing Processor *)
WHILE (Memory_Flag != 1)
Wait for code to update;
ELIHW;

Execute serializing instruction;
Begin executing modified code;

To further complicate matters, the IA-32 architecture, as you know, uses a variable-length instruction set. As a consequence of that, additional problems not yet addressed may occur if the instruction lengths of unmodified and new instruction do not match. Two situations may occur

  1. The new instruction is longer than the old instruction. In this case, more than one instruction has to be modified. Modifications straddling instruction boundaries, however, are exposed to an extended set of issues that will be covered in my next post.
  2. The new instruction is shorter than the old instruction. The ramifications of this situation depend on the nature of the new instruction. If, for instance, the instruction is an unconditional branch instruction, the subsequent pad bytes will never be executed and can be neglected.If, on the other hand, execution may be resumed at the instruction following the new instruction, the pad bytes must constitute valid instructions. For this purpose, a sled consisting of nop instructions can be used to fill the pad bytes.The algorithm defined by Intel for cross-modifying code ensures that neither the old nor the new instruction is currently being executed while the modification is still in progress. Therefore, when employing this algorithm, replacing a single instruction by more than one instruction can be considered to be equally safe to replacing an instruction by an equally-sized instruction.

It is worthwhile to notice that regardless which situation applies for instrumentation, the complementary situation will apply to uninstrumentation.

Runtime Code Modification Explained, Part 2: Cache Coherency Issues

Instrumentation of a routine may comprise multiple steps. As an example, a trampoline may need to be generated or updated, followed by a modification on the original routine, which may include updatating or replacing a branch instruction to point to the trampoline.

In such cases, it is essential for maintaining consistency that the code changes take effect in a specific order. Otherwise, if the branch was written before the trampoline code has been stored, the branch would temporarily point to uninitialized memory. If multiple CPUs were involved and code became subject to execution while in such an inconsistent state, undefined execution behaviour would occur.

The order in which a program specifies memory loads and stores to be conducted is referred to as program order. On processors such as the Intel 386, this order is preserved. Contemporary processors, however, implement significantly weaker memory models. In order to speed up execution, these processors allow certain memory operations to be conducted out of order. Such reordering may, in certain situations like the one depicted before, lead to wrong results or to windows of inconsistency. To avoid such situations from occuring, the program must explicitly prohibit certain reorderings to be performed, which can be done by using memory fences.

Respecting the memory model implemented by the processor is thus crucial in order to achieve safe operation. Although both read and store operations are subject to potential reordering, only reordering of store operations is of interest in the context of the example depicted above. However, the memory model and the memory order enforced by the various CPUs addressed in this chapter differs significantly.

IA-32 and Intel 64 implement a rather strong memory model. In particular, memory stores are always carried out in program order — this holds true for both uniprocessor and multiprocessor systems. For the situation depicted above, this means that storing the updated branch target is not conducted
before all stores of writing the trampoline have completed.

SPARC V9 offers a choice between three different memory models, which differ in their guarantees they provide: Total Store Order (TSO), Partial Store Order (PSO), and Relaxed Memory Order (RMO) TSO, the strongest memory model among these three, guarantees presenvation of the order of store operations. As such, no memory fences are required. Both RMO, the weakest memory model, and PSO do not provide such guarantees. That is, to ensure that the second store is not carried out before the first store has completed, an appropriate memory fence instruction, i.e. a MEMBAR #StoreStore instruction has to be executed between the two stores.

The memory model implemented by IA-64 also allows stores to be conducted out of order. This can be prevented either by a memory fence or by specifying the second store to have release semantics: By using the st.rel instruction rather than st, the processor is indicated that this instruction must not take effect until all prior orderable instructions, which includes the first store, have taken
effect.

Instruction Cache/Data Store Incoherencies

As mentioned before, many modern microprocessors, including contemporary IA-32, SPARC and IA-64 CPUs use dedicated instruction caches. Whether these instruction caches are kept coherent with data caches depends on the architecture. Again, IA-32 and Intel 64 are more forgiving than other CPUs in this regard and keep instruction and data caches coherent — no manual intervention for flushing the instruction cache is required (As a consquence of that, NtFlushInstructionCache is essentially a noop on these architectures).

SPARC does not maintain this coherency automatically. To have instruction changes take effect immediately, SPARC requires the developer to issue a FLUSH instruction for each modified machine word of instructions. In a similar manner, a fc.i instruction is required on IA-64 to flush the respective instructions from the instruction cache.

Pipeline/Instruction Cache Incoherencies

Another issue that may occur when writing self- or cross-modifying code is the processor’s pipeline to become incoherent with the instruction cache. That is, although the instruction cache contains updated instructions, the CPU may continue working with outdated instructions for a while.

According to the SPARC processor manual, SPARC is not exposed to this problem and flushing the instruction cache is sufficient to avoid this problem. On IA-64, however, this incoherency can occur — whether this is in fact a problem or not depends on the individual usage scenario. However, to force having the instruction flush take effect immediately and to synchronize the instruction cache with the instruction fetch stream, issueing a sync.i instruction is required.

On IA-32, the exact behavior in case of runtime code modification in general and such incoherencies in particular slightly varies among different models. On the one hand, guarantees concerning safety of such practices have been lessened over the evolvement from the Intel 486 series to the current Core 2 series. On the other hand, certain steppings of CPU models have been explicitly documented to exhibit defective behavior in this regard. Although detailed technical information on this topic is not available, these problems seem to be possible to emerge when code is being modified that is currently in the state of execution.

Due to this variance, the exact range of issues that can arise due to runtime code modification and cache inconherencies is not clear and appropriate countermeasures cannot be easily identified. The route chosen by the Intel documentation is thus to specify an algorithm that is guaranteed to work across all processor models — although for some processors, it might be more restricting than necessary.

For cross-modifying code, the suggested algorithm makes use of serializing instructions. The role of these instructions, cpuid being among them, is to force any modifications to registers, memory and flags to be completed and to drain all buffered writes to memory before the next instruction is fetched and executed.

It is worth pointing put that Intel 64 does not seem to be exposed to this issue. Moreover, as the AMD documents state:

Synchronization for crossmodifying code is not required for code that resides within the naturally aligned quadword.

Local/Remote Incoherencies

Finally, there may be dicrepancies between which code the local processor sees and which code other processors on a SMP systems see. That is, although the stores may have already taken effect on the local CPU, they may be delayed on other CPUs so that these CPUs may continue working with the old instructions for some amount of time.

In many cases, as long as ordering is preserved, delaying is not a major problem, However, when the changes should be enforced to take effect immediately, further steps are required.

Intel 64 specifies that

Stores from a processor appear to be committed to the memory system in program order; however, stores can be delayed arbitrarily by store buffering while the processor continues operation.

As a consequence, an MFENCE instruction should be executed as soon as the code patch has been written. In a similar way, IA-64 requires an mf (memory fence) instruction to be issued after the fc.i and sync.i if changes are to take effect immediately on remote CPUs.

Runtime Code Modification Explained, Part 1: Dealing With Memory

Runtime code modification, of self modifying code as it is often referred to, has been used for decades — to implement JITters, writing highly optimized algorithms, or to do all kinds of interesting stuff. Using runtime code modification code has never been really easy — it requires a solid understanding of machine code and it is straightforward to screw up. What’s not so well known, however, is that writing such code has actually become harder over the last years, at least on the IA-32 platform: Comparing the 486 and current Core architectures, it becomes obvious that Intel, in order to allow more advanced CPU-interal optimizations, has actually lessened certain gauarantees made by the CPU, which in turn requires the programmer to pay more attection to certain details.

Looking around on the web, there are plenty of code snippets and example projects that make use of self-modifying code. Without finger-pointing specific resources, it is, however, safe to assume that a significant (and I mean significant!) fraction of these examples fail to address all potential problems related to runtime code modification. As I have shown a while ago, even Detours, which is a well-done and widely recognized and used library relying on runtime code modification has its issues:

Adopting the nomenclature suggested by the Intel processor manuals, code writing data to memory with the intent of having the same processor execute this data as code is referred to as self-modifying code. On SMP machines, it is possible for one processor to write data to memory with the intent of having a different processor execute this data as code. This process if referred to as cross-modifying code. I will jointly refer to both practices as runtime code modification.

Memory Model

The easiest part of runtime code modification is dealing with the memory model. In order to implement self-modifying or cross-modifying code, a program must be able to address the regions of memory containing the code to be modified. Moreover, due to memory protection mechanisms, overwriting code may not be trivially possible.

The IA-32 architecture offers three memory models — the flat, segmented and real mode memory model. Current OS like Windows and Linux rely on the flat memory model, so I will ignore the other two.

Whenever the CPU fetches code, it addresses memory relative to the segment mapped by the CS segment register. In the flat memory model, the CS segment register, which refers to the current code segment, is always set up to map to linear address 0. In the same manner, the data and stack segment registers (DS, SS) are set up to refer to linear address 0.

It is worth mentioning that AMD64 has retired the use of segmentation and the segment bases for code and data segment are therefore always treated as 0.

Given this setup, code can be accessed and modified on IA-32 as well as on AMD64 in the same manner as data. Easy-peasy.

Memory Protection

One of the features enabled by the use of paging is the ability to enforce memory protection. Each page can specify restrictions to which operations are allowed to be performed on memory of the respective page.

In the context of runtime code modification, memory protection is of special importance as memory containing code usually does not permit write access, but rather read and execute access only. A prospective solution thus has to provide a means to either circumvent such write protection or to temporarily grant write access to the required memory areas.

As other parts of the image are write-protected as well, memory protection equally applies to approaches that modify non-code parts of the image such as the Import Address Table. That’s why the call to VirtualProtect is neccessary when Patching the IAT. Programs using runtime code modification often do not restrict themselves to changing existing code but rather generate additional code. Assuming Data Execution Prevention has been enabled, it is thus vital for such approaches to work properly that any code generated is placed into memory regions that grant execute access. While user mode implementations can rely on a feature of the RTL heap (i.e. using the HEAP_CREATE_ENABLE_EXECUTE when calling RtlCreateHeap) for allocating executable memory, no comparable facility for kernel mode exist — a potential instrumentation solution thus has to come up with a custom allocation strategy.

Jump distances

Whenever code is being generated, odds are that there are branching instructions involved. Depending on where memory for the new code has been allocated and where the branch targets falls, the offset between the branching instruction itself and the jump target may be of significant size. In such cases, the software has to make sure that the branch instruction chosen does in fact support offsets at least as large as required for the individual purpose. This sounds trivial, but it is not: Software that overwrites existing code with a branch may face severe limitation w.r.t. how many bytes the branch instruction may occupy — if, for example, there is less than 5 bytes of space (assuming IA-32), a far jump cannot be used. To use a near jump, however, the newly allocated code better be near.

Further safety concerns will be discussed in Part 2 of this series of posts.

Windows Hotpatching: A Walkthrough

As discussed in the last post, Windows 2003 SP1 introduced a technology known as Hotpatching. An integral part of this technology is Hotpatching, which refers to the process of applying an updated on the fly by using runtime code modification techniques.

Although Hotpatching has caught a bit of attention, suprisingly little information has been published about its inner workings. As the technology is patented, however, there is quite a bit of information that can be obtained by reading the patent description. Moreover, there is this (admittedly very terse) discussion about the actual implementation of hotpatching.

Armed with this information, it is possible to get into more detail by looking what is actually happening under the hood when a hoftix is applied: I did so and chose KB911897 as an example, which fixes some flaw in mrxsmb.sys and rdbss.sys. I have also gone through the hassle of translating key parts of the respective assembly code back to C.

Preparing the machine

First, we need a proper machine image which can be used for the experiment. Unfortunately, KB911897 is an SP1 package, so we have to use an old Win 2003 Server SP1 system to apply this update. Once we have the machine running, we can attach the kernel debugger and see what is happening when the hotfix is installed.

Observing the update

When launched with /hotpatch:enable, after some initialization work, the updater calls NtSetSystemInformation (which delegates to ExApplyCodePatch) to apply the hotpatch. Hotpatching includes a coldpatch, which I do not care about here and the actual hotpatch. The first two calls to NtSetSystemInformation (and thus to ExApplyCodePatch) are coldpatching-related and I will thus ignore them here. The third call, however, is made to apply the actual hotpatch, so let’s observe this one further.

Requiring a kernel mode-patch, ExApplyCodePatch then calls MmHotPatchRoutine, which is where the fun starts. Expressed in C, MmHotPatchRoutine, MmHotPatchRoutine roughly looks like this (reverse engineered from assembly, might be slightly incorrect):

NTSTATUS MmHotPatchRoutine(
  __in PSYSTEM_HOTPATCH_CODE_INFORMATION RemoteInfo
  )
{
  UNICODE_STRING ImageFileName;
  DWORD Flags = RemoteInfo->Flags;
  PVOID ImageBaseAddress;
  PVOID ImageHandle;
  NTSTATUS Status, LoadStatus;
  KTHREAD CurrentThread;

  ImageFileName.Length = RemoteInfo->KernelInfo.NameLength;
  ImageFileName.MaximumLength = RemoteInfo->KernelInfo.NameLength;
  ImageFileName.Buffer = ( PBYTE ) RemoteInfo + NameOffset;

  CurrentThread = KeGetCurrentThread();
  KeEnterCriticalRegion( CurrentThread );

  KeWaitForSingleObject(
    MmSystemLoadLock,
    WrVirtualMemory,
    0,
    0,
    0 );

  LoadStatus = MmLoadSystemImage(
    &ImageFileName,
    0,
    0,
    0,
    &ImageHandle,
    &ImageBaseAddress );
  if ( NT_SUCCESS( Status ) || Status == STATUS_IMAGE_ALREADY_LOADED )
  {

    Status = MiPerformHotPatch(
      ImageHandle,
      ImageBaseAddress,
      Flags );
    
    if ( NT_SUCCESS( Status ) || LoadStatus == STATUS_IMAGE_ALREADY_LOADED )
    {
      NOTHING;
    }
    else
    {
      MmUnloadSystemImage( ImageHandle );
    }
    
    LoadStatus = Status;
  }


  KeReleaseMutant(
    MmSystemLoadLock,
    1,  // increment
    FALSE,
    FALSE );

  KeLeaveCriticalRegion( CurrentThread );

  return LoadStatus;
}

As you see in the code, MmHotPatchRoutine will try load the hotpatch image — we can verify this in the debugger:

kd> bp nt!MmLoadSystemImage

kd> g
Breakpoint 3 hit
nt!MmLoadSystemImage:
808ec4b5 6878010000      push    178h

kd> k
ChildEBP RetAddr  
f6acbb28 80990c9e nt!MmLoadSystemImage
f6acbb68 809b2d67 nt!MmHotPatchRoutine+0x59
f6acbba8 808caeff nt!ExApplyCodePatch+0x191
f6acbd50 8082337b nt!NtSetSystemInformation+0xa1e
f6acbd50 7c82ed54 nt!KiFastCallEntry+0xf8
0006bc50 7c821f24 ntdll!KiFastSystemCallRet
0006bd44 7c8304c9 ntdll!ZwSetSystemInformation+0xc
[...]

kd> dt _UNICODE_STRING poi(@esp+4)
ntdll!_UNICODE_STRING
 "\??\c:\windows\system32\drivers\hpf3.tmp"
   +0x000 Length           : 0x50
   +0x002 MaximumLength    : 0x50
   +0x004 Buffer           : 0x81623fa8  "\??\c:\windows\system32\drivers\hpf3.tmp"
   
kd> gu

kd> lm
start    end        module name
[...]           
f6ba4000 f6bad000   hpf3       (deferred)  
[...]
f95cb000 f9641000   mrxsmb     (deferred)  
f9641000 f9671000   rdbss      (deferred)      
[...]

Having loaded the hotpatch image, MmHotPatchRoutine proceeds be calling MiPerformHotPatch, which looks about like this:

NTSTATUS
MiPerformHotPatch(
  IN PLDR_DATA_TABLE_ENTRY ImageHandle,
  IN PVOID ImageBaseAddress,
  IN DWORD Flags
  )
{
  PHOTPATCH_HEADER SectionData ;
  PRTL_PATCH_HEADER Header;    
  NTSTATUS Status;
  PVOID LockVariable;
  PVOID LockedBuffer;
  BOOLEAN f;
  PLDR_DATA_TABLE_ENTRY LdrEntry;

  SectionData = RtlGetHotpatchHeader( ImageBaseAddress );
  if ( ! SectionData  )
  {
    return STATUS_INVALID_PARAMETER;
  }
  
  //
  // Try to get header from MiHotPatchList
  //
  Header = RtlFindRtlPatchHeader(
    MiHotPatchList,
    ImageHandle );

  if ( ! Header )
  {
    PLIST_ENTRY Entry;

    if ( Flags & FLG_HOTPATCH_ACTIVE )
    {
      return STATUS_NOT_SUPPORTED;
    }

    Status = RtlCreateHotPatch(
      &Header,
      SectionData,
      ImageHandle,
      Flags
      );
    if ( ! NT_SUCCESS( Status ) )
    {
      return Status;
    }

    ExAcquireResourceExclusiveLite(
      PsLoadedModuleResource,
      TRUE
      );

    Entry =  PsLoadedModuleList;
    while ( Entry != PsLoadedModuleList )
    {
      LdrEntry = DataTableEntry = CONTAINING_RECORD( Entry,
                                            KLDR_DATA_TABLE_ENTRY,
                                            InLoadOrderLinks )
      if ( LdrEntry->DllBase DllBase >= MiSessionImageEnd )
      {
        if ( RtlpIsSameImage( Header, LdrEntry ) )
        {
          break;
        }
      }
    }

    ExReleaseResourceLite( PsLoadedModuleResource );

    if ( ! PatchHeader->TargetDllBase )
    {
      Status = STATUS_DLL_NOT_FOUND ;
    }

    Status = ExLockUserBuffer(
      ImageHandle->DllBase,
      ImageHandle->SizeOfImage,
      KernelMode,
      IoWriteAccess,
      LockedBuffer,
      LockVariable
      );
    if ( ! NT_SUCCESS( Status ) )
    {
      FreeHotPatchData( Header );
      return Status;
    }


    Status = RtlInitializeHotPatch(
      ( PRTL_PATCH_HEADER ) Header,
      ( PBYTE ) LockedBuffer - ImageHandle->DllBase
      );

    ExUnlockUserBuffer( LockVariable );

    if ( ! NT_SUCCESS( Status ) )
    {
      FreeHotPatchData( ImageHandle );
      return Status;
    }

    f = 1;
  }
  else
  {
    if ( ( Flags ^ ImageHandle->CodeInfo->Flags ) & FLG_HOTPATCH_ACTIVE )
    {
      return STATUS_NOT_SUPPORTED;
    }

    if ( ! ( ImageHandle->CodeInfo->Flags & FLG_HOTPATCH_ACTIVE ) )
    {
      Status = RtlReadHookInformation( Header );
      if ( ! NT_SUCCESS( Status ) )
      {
        return Status;
      }
    }

    f = 0;
  }
  
  Status = MmLockAndCopyMemory(
    ImageHandle->CodeInfo,
    KernelMode
    );
  if ( NT_SUCCESS( Status ) )
  {
    if ( ! f  )
    {
      return Status;
    }

    LdrEntry->EntryPointActivationContext = Header;  // ???
    InsertTailList( MiHotPatchList, LdrEntry->PatchList );
  }
  else
  {
    if ( f ) 
    {
      RtlFreeHotPatchData( Header );
    }
  }

  return Status;
}

So MiPerformHotPatch inspects the hotpatch information stored in the hotpatch image. This data includes information about which code regions need to be updated. After the neccessary information has been gathered, it applies the code changes.

Two basic problems have to be overcome now: On the one hand, all code sections of drivers are mapped read/execute only. Overwring the instructions thus does not work. On the other hand, the system has to properly synchronize the patching process, i.e. it has to make sure no CPU is currently executing the code that is about to be patched.

To overcome the memory protection problems, Windows facilitates a trick I previously only knew from malware: It creates a memory descriptor list (MDL) for the affected code region, maps the MDL, and updates the code through this mapped region. The memory protection is thus circumvented. As it turns, out, there is even a handy, undocumented helper routine for this purpose: ExLockUserBuffer, which is used by MiPerformHotPatch.

To proceed, MiPerformHotPatch calls MmLockAndCopyMemory to do the actual patching. So how does Windows synchronize the update process? Again, it uses a technique I assumed was a malware trick: It schedules CPU-specific DPCs on all CPUs but the current and keeps those DPCs busy while the current thread is uddating the code. Again, Windows provides a neat routine for that: KeGenericCallDpc. In addition to this, Windows raises the IRQL to clock level in order to mask all interrupts.

Here is the pseudo-code for MmLockAndCopyMemory and its helper, MiDoCopyMemory:

NTSTATUS
MmLockAndCopyMemory (
    IN PSYSTEM_HOTPATCH_CODE_INFORMATION PatchInfo,
    IN KPROCESSOR_MODE ProbeMode
    )
{
  PVOID Buffer;
  NTSTATUS Status;
  UINT Index;

  if ( 0 == PatchInfo->CodeInfo.DescriptorsCount )
  {
    return STATUS_SUCCESS;
  }

  Buffer = ExAllocatePoolWithQuotaTag( 
    9,
    PatchInfo->CodeInfo.DescriptorsCount * 2,
    'PtoH' );
  if ( ! Buffer )
  {
    return STATUS_INSUFFICIENT_RESOURCES;
  }
  RtlZeroMemory( Buffer, PatchInfo->CodeInfo.DescriptorsCount * 2 );

  if ( 0 == PatchInfo->CodeInfo.DescriptorsCount )
  {
    Status = STATUS_INVALID_PARAMETER;
    goto Cleanup;
  }

  for ( Index = 0; Index CodeInfo.DescriptorsCount; Index++ )
  {
    if ( PatchInfo->CodeInfo.CodeDescriptors[ Index ].CodeOffset > PatchInfo->InfoSize ||
       PatchInfo->CodeInfo.CodeDescriptors[ Index ].CodeSize > PatchInfo->InfoSize ||
       PatchInfo->CodeInfo.CodeDescriptors[ Index ].CodeOffset +
       PatchInfo->CodeInfo.CodeDescriptors[ Index ].CodeSize > PatchInfo->InfoSize || 
       /* other checks... */ )
    {
      Status = STATUS_INVALID_PARAMETER;
      goto Cleanup;
    }

    Status = ExLockUserBuffer(
      TargetAddress,
      PatchInfo->CodeInfo.CodeDescriptors[ Index ].CodeSize
      ProbeMode,
      IoWriteAccess,
      &PatchInfo->CodeInfo.CodeDescriptors[ Index ].MappedAddress,
      Buffer[ Index ]
      );
    if ( ! NT_SUCCESS( Status ) )
    {
      goto Cleanup;
    }
  }

  PatchInfo->Flags |= FLG_HOTPATCH_ACTIVE;

  KeGenericCallDpc(
    MiDoCopyMemory,
    PatchInfo );

  if ( PatchInfo->Flags & FLG_HOTPATCH_VERIFICATION_ERROR )
  {
    PatchInfo->Flags &= ~FLG_HOTPATCH_ACTIVE;
    PatchInfo->Flags &= ~FLG_HOTPATCH_VERIFICATION_ERROR;
    Status = STATUS_DATA_ERROR;
  }

Cleanup:
  if ( PatchInfo->CodeInfo.DescriptorsCount > 0 )
  {
    for ( Index = 0; Index CodeInfo.DescriptorsCount; Index++ )
    {
      ExUnlockUserBuffer( Buffer[ Index ] );
    }
  }

  ExFreePoolWithTag( Buffer, 0 );
  return Status;
}

VOID MiDoCopyMemory(
  IN PKDPC Dpc,
  IN PSYSTEM_HOTPATCH_CODE_INFORMATION PatchInfo,
  IN ULONG NumberCpus,
  IN DEFERRED_REVERSE_BARRIER ReverseBarrier
  )
{
  KIRQL OldIrql;
  UNREFERENCED_PARAMETER( Dpc );
  NTSTATUS Status;
  ULONG Index;

  OldIrql = KfRaiseIrql( CLOCK1_LEVEL );

  //
  // Decrement reverse barrier count.
  //
  Status = KeSignalCallDpcSynchronize( ReverseBarrier );
  if ( ! NT_SUCCESS( Status ) )
  {
    goto Cleanup;
  }

  PatchInfo->Flags &= ~FLG_HOTPATCH_VERIFICATION_ERROR;
    
  for ( Index = 0; Index CodeInfo.DescriptorsCount; Index++ )
  {
    if ( PatchInfo->Flags & FLG_HOTPATCH_ACTIVE )
    {
      if ( PatchInfo->CodeInfo.CodeDescriptors[ Index ].ValidationSize != 
        RtlCompareMemory(
          PatchInfo->CodeInfo.CodeDescriptors[ Index ].MappedAddress,
          ( PBYTE ) PatchInfo + PatchInfo->CodeInfo.CodeDescriptors[ Index ].ValidationOffset,
          PatchInfo->CodeInfo.CodeDescriptors[ Index ].ValidationSize ) )
      {

        if ( PatchInfo->CodeInfo.CodeDescriptors[ Index ].CodeSize != 
          RtlCompareMemory(
            PatchInfo->CodeInfo.CodeDescriptors[ Index ].MappedAddress,
            ( PBYTE ) PatchInfo + PatchInfo->CodeInfo.CodeDescriptors[ Index ].OrigCodeOffset,
            PatchInfo->CodeInfo.CodeDescriptors[ Index ].CodeSize ) )
        {
          PatchInfo->Flags &= FLG_HOTPATCH_VERIFICATION_ERROR;
          break;
        }
      }
    }
    else
    {
      if ( PatchInfo->CodeInfo.CodeDescriptors[ Index ].CodeSize !=
        RtlComparememory(
          PatchInfo->CodeInfo.CodeDescriptors[ Index ].MappedAddress,
          ( PBYTE ) PatchInfo + PatchInfo->CodeInfo.CodeDescriptors[ Index ].CodeOffset,
          PatchInfo->CodeInfo.CodeDescriptors[ Index ].CodeSize ) )
      {
        PatchInfo->Flags &= FLG_HOTPATCH_VERIFICATION_ERROR;
        break;
      }
    }
  }

  //loc_479533
  if ( PatchInfo->Flags & FLG_HOTPATCH_VERIFICATION_ERROR ||
     PatchInfo->CodeInfo.DescriptorsCount <= 0 )
  {
    goto Cleanup;
  }

  for ( Index = 0; Index CodeInfo.DescriptorsCount; Index++ )
  {
    PVOID Source;
    if ( PatchInfo->Flags & FLG_HOTPATCH_ACTIVE )
    {
      Source = ( PBYTE ) PatchInfo + PatchInfo->CodeInfo.CodeDescriptors[ Index ].CodeOffset;
    }
    else
    {
      Source = ( PBYTE ) PatchInfo + PatchInfo->CodeInfo.CodeDescriptors[ Index ].OrigCodeOffset;
    }

    RtlCopyMemory(
      PatchInfo->CodeInfo.CodeDescriptors[ Index ].MappedAddress,
      Source,
      PatchInfo->CodeInfo.CodeDescriptors[ Index ].CodeSize
      );
  }


Cleanup:
   KeSignalCallDpcSynchronize( ReverseBarrier );
   KfLowerIrql( OldIrql );
   KeSignalCallDpcDone( NumberCpus );
}

To see the code, in action, we set a breakpoint on nt!MiDoCopyMemory:

kd> k
ChildEBP RetAddr  
f6acbac0 8087622f nt!MiDoCopyMemory
f6acbae8 80990a10 nt!KeGenericCallDpc+0x3d
f6acbb0c 80990bea nt!MmLockAndCopyMemory+0xf1
f6acbb34 80990cba nt!MiPerformHotPatch+0x143
f6acbb68 809b2d67 nt!MmHotPatchRoutine+0x75
f6acbba8 808caeff nt!ExApplyCodePatch+0x191
f6acbd50 8082337b nt!NtSetSystemInformation+0xa1e

Before letting MiDoCopyMemory do its work, let’s see what it is about to do. No modifications have yet been done to mrxsmb:

kd> !chkimg mrxsmb
0 errors : mrxsmb 

kd> !chkimg rdbss
0 errors : rdbss

The second argument is a structure holding the information garthered previously, peeking into it reveals:

kd> dd /c 1 poi(esp+8) l 4
81583008  00000001
8158300c  00000149
81583010  00000008   <-- # of code patches
81583014  f9648b1f   <-- hmm...

As it turns out, address 81583014 refers to a variable length array of size 8. Poking aroud with dd, the following listing suggests that the structure is of size 28 bytes:

kd> dd /c 7 81583014
81583014  f9648b1f fa2afb1f 000000ec 00000005 000000f1 000000f6 00000005
81583030  f9648b24 fa2b2b24 000000fb 00000002 000000fd 000000ff 00000002
8158304c  f96585ef fa2b15ef 00000101 00000005 00000106 0000010b 00000005
81583068  f96585f4 fa2b45f4 00000110 00000002 00000112 00000114 00000002
81583084  f9658569 fa2b3569 00000116 00000005 0000011b 00000120 00000005
815830a0  f965856e fa2b656e 00000125 00000002 00000127 00000129 00000002
815830bc  f9653378 fa2b5378 0000012b 00000005 00000130 00000135 00000005
815830d8  f965337d fa2b837d 0000013a 00000005 0000013f 00000144 00000005

Given that rdbss was loaded to address range f9641000-f9671000, it is obvious that the first 2 columns refer to code addresses. The third, fifth and sixth column looks like an offset, the fourth and seventh like the length of the code change. First, let’s see where the first column points to:

kd> u f9648b1f
rdbss!RxInitiateOrContinueThrottling+0x6b:
f9648b1f 90              nop
f9648b20 90              nop
f9648b21 90              nop
f9648b22 90              nop
f9648b23 90              nop
rdbss!RxpCancelRoutine:
f9648b24 8bff            mov     edi,edi
f9648b26 55              push    ebp
f9648b27 8bec            mov     ebp,esp

Now that looks promising, especially since the fourth column holds the value 5. Let’s look at the second row:

kd> u f9648b24
rdbss!RxpCancelRoutine:
f9648b24 8bff            mov     edi,edi

No doubt, the first and second row define the two patches necessary to redirect RxpCancelRoutine. But what to replace this code with? As it turns out, the offsets in column three are relative to the structure and point to the code that is to be written:

kd> u poi(esp+8)+000000ec
815830f4 e9dcc455fd      jmp     7eadf5d5          mov     edi,edi

kd> u poi(esp+8)+000000fb
81583103 ebf9            jmp     815830fe

That makes perfectly sense — the five nops are to be overwritten by a near jump, the mov edi, edi will be replaced by a short jump.

So let’s run MiDoCopyMemory and have a look at the results. Back in MmLockAndCopyMemory, the code referred to by the first to rows look like this:

kd> u f9648b1f
rdbss!RxInitiateOrContinueThrottling+0x6b:
f9648b1f e9dcc455fd      jmp     hpf3!RxpCancelRoutine (f6ba5000)

kd> u f9648b24
rdbss!RxpCancelRoutine:
f9648b24 ebf9            jmp     rdbss!RxInitiateOrContinueThrottling+0x6b (f9648b1f)
f9648b26 55              push    ebp
f9648b27 8bec            mov     ebp,esp

VoilĂ , RxpCancelRoutine has been patched and calls are redirected to hpf3!RxpCancelRoutine, the new routine located in the auxiliarry ‘hpf3’ driver. All that remains to be done is cleanup (unlocking the memory etc).

That’s it — that’s how Windows applies patches on the fly using hotpatching. Too bad that the technology is so rarely used in practice.

#ifdef _WIN32

When writing processor-specific code, the _M_IX86, _M_AMD64 and _M_IA64 can be used for conditional compilation — so far, so good. But sometimes code is not exactly processor-specific but rather specific to the natural machine word length (i.e. 32 bit or 64 bit). Fur such situations, there are defines, too — however there is a little catch: For ancient 16 bit code, there is _WIN16. For 64 bit, the WDK build environment defines _WIN64 by default. Given these two macros, it is tempting to conclude that _WIN32 should only be defined for 32 bit builds — however this is not the case. As it turns out, _WIN32 is always defined, both for 32 and 64 bit builds.

And yes, this behaviour is documented on MSDN, but it is stupid anyway.

However, where _WIN32 can be of use is when writing code targeting multiple platforms — as _WIN32 is always defined, it can be used as an indicator that you compile for Windows, regardless of the compiler used (another option is to use _MSC_VER, but that is compiler-specific).

Windows Hotpatching

Several years ago, with Windows Server 2003 SP1, Microsoft introduced a technology and infrastructure called Hotpatching. The basic intent of this infrastructure is to provide a means to apply hotfixes on the fly, i.e. without having to reboot the system — even if the hotfix contains changes on critical system components such as the kernel iteself, important drivers, or user mode libraries such as shell32.dll.

Trying to applying hotfixes on the fly introduces a variety of problems — the most important being:

  • Patching code that is currently in use
  • Atomically replacing files on disk that are currently in use and therefore locked
  • Making sure that all changes take effect for both, processes currently running and processes which are yet to be started (i.e. before the next reboot)
  • Allowing further hotfixes to be applied on system that has not been rebooted since the last hotfix has been applied in an on-the-fly fashion

The Windows Hotpatching infrastructure is capable of handling all these cases — it is, however, not applicable to all kinds of code fixes. Generally speaking, it can only be used for fixes that merely comprise smallish code changes but do not affect layout or semantics of data structures. A fix for a buffer overflow caused by an off-by-one error, however, is a perfect example for a fix that could certainly be applied using the Hotpatching infrastructure.

That all sounds good and nice, but reality is that we still reboot our machines for just about every update Microsoft provides us, right?

Right. The answer for this is threefold. First, as indicated, some hotfixes can be expected to make changes that cannot be safely applied using the Hotpatching system. Secondly, Hotpatching is used on an opt-in basis, so you will not benefit from it automatically: When a hotpatch-enabled hotfix is applied through Windows Update or by launching the corresponding exe file, it is not used and a reboot will be required. The user has to explicitly specify the /hotpatch:enable switch in order to have the hotfix to be applied on the fly.

In the months after the release of SP1, a certain fraction of the hotfixes issued by Microsoft were indeed hotpatch-enabled and could be applied without a reboot. Interestingly, however, I am not aware of a single hotfix issued since Server 2003 SP2 that supported hotpatching!

And thirdly: Whether Microsoft has lost faith in their hotpatching facility, whether the effort to test such hotfixes turned out to be too high or whether there were other reasons speaking against issueing hotpatch-enabled hotfixes — I do not know.

Notwithstanding this observation, Hotpatching is an interesting technology that deserves to be looked at in more detail. Although I will not cover the entire infrastructure, I will spend at least one more blog post on the mechanisms implemented in Windows that allow code modifications to be performed on the fly. That is, I will focus on the hotpatching part of the infrastructure and will ignore coldpatching and other, smaller aspects of the infrastructre.


Categories




About me

Johannes Passing, M.Sc., living in Berlin, Germany.

Besides his consulting work, Johannes mainly focusses on Win32, COM, and NT kernel mode development, along with Java and .Net. He also is the author of cfix, a C/C++ unit testing framework for Win32 and NT kernel mode, Visual Assert, a Visual Studio Unit Testing-AddIn, and NTrace, a dynamic function boundary tracing toolkit for Windows NT/x86 kernel/user mode code.

Contact Johannes: jpassing (at) acm org

Johannes' GPG fingerprint is BBB1 1769 B82D CD07 D90A 57E8 9FE1 D441 F7A0 1BB1.

LinkedIn Profile
Xing Profile
Github Profile