Archive for February, 2015

Runtime Code Modification Explained, Part 4: Keeping Execution Flow Intact


Concurrent Execution

A typical user mode process on a Windows system can be expected to have more than one thread. In addition to user threads, the Windows kernel employs a number of system threads. Given the presence of multiple threads, it is likely that whenever a code modification is performed, more than one thread is affected, i.e. more than one thread is sooner or later going to execute the modified code sequence.

The basic requirement that has to be met is that even in the presence of a preemptive, multi threaded, multiprocessing environment, an instrumentation solution has to ensure that any other thread either does not run the affected code at all, runs code not yet reflecting the respective modifications or runs code reflecting the entire set of respective modifications.

On a multiprocessor system, threads are subject to concurrent execution. While one thread is currently performing code modifications, another thread, running on a different processor, may concurrently execute the affected code.

If only a single instruction is to be modified and the cited algorithm for cross-modifying code is used, concurrent execution, preemption and interruption should not be of concern. Any other thread will either execute the old or new instruction, but never a mixture of both.

However, the situation is different when more than one instruction is to be modified. In this case, a different thread may execute partially modified code.

Although code analysis may indicate certain threads not to ever call the routine comprising the affected code, signals or Asynchronous Procedure Calls (APCs) executed on this thread may. Therefore, a separation in affected and non-affected threads may not always be possible and it is safe to assume that all threads are potentially affected.

Preemption and Interruption

Both on a multiprocessor and a uniprocessor system, all threads running in user mode as well as threads running in kernel mode at IRQL APC_LEVEL or below are subject to preemption. Similarly, for a thread running at DISPATCH_LEVEL or Device IRQL (DIRQL), it is also possible to be interrupted by a device interrupt. As these situations are similar, only the case of preemption is discussed.

If only a single instruction is to be modified, preemption and interruption may not be problematic. If, however, multiple instructions are to be adapted, the ramifications of preemption in this context are twofold. On the one hand, the code performing the modification may be preempted while being in the midst of a multi-step runtime code modification operation:

  • Thread A performs a runtime code modification. Before the last instruction has been fully modified, the thread is preempted. The instruction stream is now in a partially-modified state.
  • Thread B begins executing the code that has been modified by Thread A. In case instruction boundaries of old and new code match, the instruction sequence that is now run by Thread B should consist of valid instructions only, yet the mixture of old and new code may define unintended behavior. If instruction lengths do not match, the situation is worse. After Thread B has executed the last fully-modified instruction, the CPU will encounter a partially-overwritten instruction. Not being aware of this shift of instruction boundaries, the CPU will interpret the following bytes as instruction stream, which may or may not consist of valid instructions. As the code now executed has never been intended to be executed, the behavior of Thread B may now be considered arbitrary.

In order to avoid such a situation from occurring, an implementation can disable preemption and interruption by raising the IRQL above DIRQL during the modification process.

On the other hand, the code performing the code modification may run uninterrupted, yet one of the preempted threads might become affected:

  • Thread A has begun executing code being part of the instruction sequence that is about to be modified. Before having completed executing the last instruction of this sequence, it is preempted.
  • Thread B is scheduled for execution and performs the runtime code modification. Not before all instructions have been fully modified, it is preempted.
  • Thread A is resumed. Two situations may now occur — either the memory pointed to by the program counter still defines the start of a new instruction or — due to instruction boundaries having moved — it points into the middle of an instruction. In the first case, a mixture of old and new code is executed. In the latter case, the memory is reinterpreted as instruction stream. In both cases, the thread is likely to exhibit unintended behavior.

One approach of handling such situations is to prevent them from occurring by adapting the scheduling subsystem of the kernel. However, supporting kernel preemption is a key characteristic of the Windows NT scheduler — removing the ability to preempt kernel threads thus hardly seems like an auspicious approach. Regarding the Linux kernel, however, it is worth noting that kernel preemption is in fact an optional feature supported on more recent versions (2.6.x) only. As a consequence, for older versions or kernels not using this option, the situation as described in the previous paragraph cannot occur.

A more lightweight approach to this problem relies on analysis of concurrently running as well as preempted threads. That is, the program counters of all threads are inspected for pointing to regions that are about to be affected by the code modification. If this is the case, the code modification is deemed unsafe and is aborted. Needless to say, it is crucial that all threads are either preempted or paused during this analysis as well as during the code modification itself. As the thread performing the checks and modifications is excluded from being paused and analyzed, it has to be assured that this thread itself is not in danger of interfering with the code modification.

In a similar manner, the return addresses of the stack frames of each stack can be inspected for pointing to such code regions. Stack walking, however, is exposed to a separate set of issues that I’ll disuss separately.

Rather than aborting the operation in case one of the threads is found to be negatively affected by the pending code modification, a related approach is to attempt to fix the situation. That is, the program counters of the affected threads are updated so that they can resume properly.

One example for a user-mode solution implementing this approach is Detours. Before conducting any code modification, Detours suspends all threads a user has specified as being potentially affected by this operation. After having completed all code modifications, all suspended threads are inspected and their program counters are adapted if necessary. Not before this step has completed, the threads are resumed.

Basic Block Boundaries

Another issue of multiple instruction modification is related to program flow. Whenever a sequence of instructions that is to be altered spans multiple basic blocks, it is possible that not only the first instruction of the sequence, but also one of the subsequent instructions may be a branch target. When instruction boundaries are not preserved by the code modification step, the branch target might fall into in the midst of one of the new instructions. Again, such a situation is likely to lead to unintended program behavior.

Identifying basic blocks and thus any potential branch targets requires flow analysis. However, especially in the case of optimized builds, it is insufficient to perform an analysis of the affected routine only as blocks might be shared among more than one routine. In such cases, a routine does not consist of a contiguous region of code but may be scattered throughout the image. Therefore, it is crucial to perform flow analysis on the entire image. But even in this case, the existence of indirect branches may render a complete analysis impossible in practice.

Another situation where an instrumentation solution may run into the danger of overwriting basic block boundaries is the instrumentation of very short routines. If the routine is shorter (in terms of instruction bytes occupied) than the
instructions that need to be injected in order to instrument the routine, the first basic block(s) of the subsequent routine may be overwritten.


Runtime Code Modification Explained, Part 3: Cross-Modifying Code and Atomicity

Performing modifications on existing code is a technique commonly encountered among instrumentation solutions such as DTrace. Assuming a multiprocessor machine, altering code brings up the challenge of properly synchronizing such activity among processors.

As stated before, IA-32/Intel64 allows code to be modified in the same manner as data. Whether modifying data is an atomic operation or not, depends on the size of the operand. If the total number of bytes to be modified is less than 8 and the target address adheres to certain alignment requirements, current IA-32 processors guarantee atomicity of the write operation.

If any of these requirements do not hold, multiple write instructions have to be performed, which is an inherently non-atomic process. What is often ignored, however, is that even in situations where using atomic writes or bus locking (i.e. using the lock prefix) on IA-32 or AMD64 would be feasible, such practice would not necessarily be safe as instruction fetches are allowed to pass locked instructions. Quoting of the Intel manual:

Locked operations are atomic w.r.t. all other memory operations and all externally visible events. Only instruction fetch and page table accesses can pass locked instructions. Locked instructions can be used to synchronize data written by one processor and read by another processor.

Although appealing, merely relying on the atomicity of store operations must therefore in many cases be assumed to be insufficient for ensuring safe operation.

The exact behavior in case of runtime code modifications also slightly varies among different CPU models. On the one hand, guarantees concerning safety of such practices have, as indicated before, been lessened over the evolvement from the Intel 486 series to the current Core 2 series. On the other hand, certain steppings of CPU models even exhibit defective behavior in this regard, as explained in several Intel errata, including this one for the Pentium III Xeon.

Due to this variance, the exact range of issues that can arise when performing code modifications is not clear and appropriate countermeasures cannot be easily identified. As described in these errata, cross-modifying code not adhering to certain coding practices described later, can lead to “unexpected execution behavior”, which may include the generation of exceptions.

The route chosen by the Intel documentation is thus to specify an algorithm that is guaranteed to work across all processor models — although for some processors, it might be more restricting than necessary.

For cross-modifying code, the suggested algorithm makes use of serializing instructions. The role of these instructions, cpuid being one of them, is to force any modifications to registers, memory and flags to be completed and to drain all buffered writes to memory before the next instruction is fetched and executed.

Quoting the algorithm defined in the Intel manual:

(* Action of Modifying Processor *)
Memory_Flag <- 0; (* Set Memory_Flag to value other than 1 *)
Store modified code (as data) into code segment;
Memory_Flag <- 1;

(* Action of Executing Processor *)
WHILE (Memory_Flag != 1)
Wait for code to update;

Execute serializing instruction;
Begin executing modified code;

To further complicate matters, the IA-32 architecture, as you know, uses a variable-length instruction set. As a consequence of that, additional problems not yet addressed may occur if the instruction lengths of unmodified and new instruction do not match. Two situations may occur

  1. The new instruction is longer than the old instruction. In this case, more than one instruction has to be modified. Modifications straddling instruction boundaries, however, are exposed to an extended set of issues that will be covered in my next post.
  2. The new instruction is shorter than the old instruction. The ramifications of this situation depend on the nature of the new instruction. If, for instance, the instruction is an unconditional branch instruction, the subsequent pad bytes will never be executed and can be neglected.If, on the other hand, execution may be resumed at the instruction following the new instruction, the pad bytes must constitute valid instructions. For this purpose, a sled consisting of nop instructions can be used to fill the pad bytes.The algorithm defined by Intel for cross-modifying code ensures that neither the old nor the new instruction is currently being executed while the modification is still in progress. Therefore, when employing this algorithm, replacing a single instruction by more than one instruction can be considered to be equally safe to replacing an instruction by an equally-sized instruction.

It is worthwhile to notice that regardless which situation applies for instrumentation, the complementary situation will apply to uninstrumentation.


About me

Johannes Passing lives in Berlin, Germany and works as a Solutions Architect at Google Cloud.

While mostly focusing on Cloud-related stuff these days, Johannes still enjoys the occasional dose of Win32, COM, and NT kernel mode development.

He also is the author of cfix, a C/C++ unit testing framework for Win32 and NT kernel mode, Visual Assert, a Visual Studio Unit Testing-AddIn, and NTrace, a dynamic function boundary tracing toolkit for Windows NT/x86 kernel/user mode code.

Contact Johannes: jpassing (at) hotmail com

LinkedIn Profile
Xing Profile
Github Profile