Archive for January, 2015

Runtime Code Modification Explained, Part 2: Cache Coherency Issues

Instrumentation of a routine may comprise multiple steps. As an example, a trampoline may need to be generated or updated, followed by a modification on the original routine, which may include updatating or replacing a branch instruction to point to the trampoline.

In such cases, it is essential for maintaining consistency that the code changes take effect in a specific order. Otherwise, if the branch was written before the trampoline code has been stored, the branch would temporarily point to uninitialized memory. If multiple CPUs were involved and code became subject to execution while in such an inconsistent state, undefined execution behaviour would occur.

The order in which a program specifies memory loads and stores to be conducted is referred to as program order. On processors such as the Intel 386, this order is preserved. Contemporary processors, however, implement significantly weaker memory models. In order to speed up execution, these processors allow certain memory operations to be conducted out of order. Such reordering may, in certain situations like the one depicted before, lead to wrong results or to windows of inconsistency. To avoid such situations from occuring, the program must explicitly prohibit certain reorderings to be performed, which can be done by using memory fences.

Respecting the memory model implemented by the processor is thus crucial in order to achieve safe operation. Although both read and store operations are subject to potential reordering, only reordering of store operations is of interest in the context of the example depicted above. However, the memory model and the memory order enforced by the various CPUs addressed in this chapter differs significantly.

IA-32 and Intel 64 implement a rather strong memory model. In particular, memory stores are always carried out in program order — this holds true for both uniprocessor and multiprocessor systems. For the situation depicted above, this means that storing the updated branch target is not conducted
before all stores of writing the trampoline have completed.

SPARC V9 offers a choice between three different memory models, which differ in their guarantees they provide: Total Store Order (TSO), Partial Store Order (PSO), and Relaxed Memory Order (RMO) TSO, the strongest memory model among these three, guarantees presenvation of the order of store operations. As such, no memory fences are required. Both RMO, the weakest memory model, and PSO do not provide such guarantees. That is, to ensure that the second store is not carried out before the first store has completed, an appropriate memory fence instruction, i.e. a MEMBAR #StoreStore instruction has to be executed between the two stores.

The memory model implemented by IA-64 also allows stores to be conducted out of order. This can be prevented either by a memory fence or by specifying the second store to have release semantics: By using the st.rel instruction rather than st, the processor is indicated that this instruction must not take effect until all prior orderable instructions, which includes the first store, have taken

Instruction Cache/Data Store Incoherencies

As mentioned before, many modern microprocessors, including contemporary IA-32, SPARC and IA-64 CPUs use dedicated instruction caches. Whether these instruction caches are kept coherent with data caches depends on the architecture. Again, IA-32 and Intel 64 are more forgiving than other CPUs in this regard and keep instruction and data caches coherent — no manual intervention for flushing the instruction cache is required (As a consquence of that, NtFlushInstructionCache is essentially a noop on these architectures).

SPARC does not maintain this coherency automatically. To have instruction changes take effect immediately, SPARC requires the developer to issue a FLUSH instruction for each modified machine word of instructions. In a similar manner, a fc.i instruction is required on IA-64 to flush the respective instructions from the instruction cache.

Pipeline/Instruction Cache Incoherencies

Another issue that may occur when writing self- or cross-modifying code is the processor’s pipeline to become incoherent with the instruction cache. That is, although the instruction cache contains updated instructions, the CPU may continue working with outdated instructions for a while.

According to the SPARC processor manual, SPARC is not exposed to this problem and flushing the instruction cache is sufficient to avoid this problem. On IA-64, however, this incoherency can occur — whether this is in fact a problem or not depends on the individual usage scenario. However, to force having the instruction flush take effect immediately and to synchronize the instruction cache with the instruction fetch stream, issueing a sync.i instruction is required.

On IA-32, the exact behavior in case of runtime code modification in general and such incoherencies in particular slightly varies among different models. On the one hand, guarantees concerning safety of such practices have been lessened over the evolvement from the Intel 486 series to the current Core 2 series. On the other hand, certain steppings of CPU models have been explicitly documented to exhibit defective behavior in this regard. Although detailed technical information on this topic is not available, these problems seem to be possible to emerge when code is being modified that is currently in the state of execution.

Due to this variance, the exact range of issues that can arise due to runtime code modification and cache inconherencies is not clear and appropriate countermeasures cannot be easily identified. The route chosen by the Intel documentation is thus to specify an algorithm that is guaranteed to work across all processor models — although for some processors, it might be more restricting than necessary.

For cross-modifying code, the suggested algorithm makes use of serializing instructions. The role of these instructions, cpuid being among them, is to force any modifications to registers, memory and flags to be completed and to drain all buffered writes to memory before the next instruction is fetched and executed.

It is worth pointing put that Intel 64 does not seem to be exposed to this issue. Moreover, as the AMD documents state:

Synchronization for crossmodifying code is not required for code that resides within the naturally aligned quadword.

Local/Remote Incoherencies

Finally, there may be dicrepancies between which code the local processor sees and which code other processors on a SMP systems see. That is, although the stores may have already taken effect on the local CPU, they may be delayed on other CPUs so that these CPUs may continue working with the old instructions for some amount of time.

In many cases, as long as ordering is preserved, delaying is not a major problem, However, when the changes should be enforced to take effect immediately, further steps are required.

Intel 64 specifies that

Stores from a processor appear to be committed to the memory system in program order; however, stores can be delayed arbitrarily by store buffering while the processor continues operation.

As a consequence, an MFENCE instruction should be executed as soon as the code patch has been written. In a similar way, IA-64 requires an mf (memory fence) instruction to be issued after the fc.i and sync.i if changes are to take effect immediately on remote CPUs.


Runtime Code Modification Explained, Part 1: Dealing With Memory

Runtime code modification, of self modifying code as it is often referred to, has been used for decades — to implement JITters, writing highly optimized algorithms, or to do all kinds of interesting stuff. Using runtime code modification code has never been really easy — it requires a solid understanding of machine code and it is straightforward to screw up. What’s not so well known, however, is that writing such code has actually become harder over the last years, at least on the IA-32 platform: Comparing the 486 and current Core architectures, it becomes obvious that Intel, in order to allow more advanced CPU-interal optimizations, has actually lessened certain gauarantees made by the CPU, which in turn requires the programmer to pay more attection to certain details.

Looking around on the web, there are plenty of code snippets and example projects that make use of self-modifying code. Without finger-pointing specific resources, it is, however, safe to assume that a significant (and I mean significant!) fraction of these examples fail to address all potential problems related to runtime code modification. As I have shown a while ago, even Detours, which is a well-done and widely recognized and used library relying on runtime code modification has its issues:

Adopting the nomenclature suggested by the Intel processor manuals, code writing data to memory with the intent of having the same processor execute this data as code is referred to as self-modifying code. On SMP machines, it is possible for one processor to write data to memory with the intent of having a different processor execute this data as code. This process if referred to as cross-modifying code. I will jointly refer to both practices as runtime code modification.

Memory Model

The easiest part of runtime code modification is dealing with the memory model. In order to implement self-modifying or cross-modifying code, a program must be able to address the regions of memory containing the code to be modified. Moreover, due to memory protection mechanisms, overwriting code may not be trivially possible.

The IA-32 architecture offers three memory models — the flat, segmented and real mode memory model. Current OS like Windows and Linux rely on the flat memory model, so I will ignore the other two.

Whenever the CPU fetches code, it addresses memory relative to the segment mapped by the CS segment register. In the flat memory model, the CS segment register, which refers to the current code segment, is always set up to map to linear address 0. In the same manner, the data and stack segment registers (DS, SS) are set up to refer to linear address 0.

It is worth mentioning that AMD64 has retired the use of segmentation and the segment bases for code and data segment are therefore always treated as 0.

Given this setup, code can be accessed and modified on IA-32 as well as on AMD64 in the same manner as data. Easy-peasy.

Memory Protection

One of the features enabled by the use of paging is the ability to enforce memory protection. Each page can specify restrictions to which operations are allowed to be performed on memory of the respective page.

In the context of runtime code modification, memory protection is of special importance as memory containing code usually does not permit write access, but rather read and execute access only. A prospective solution thus has to provide a means to either circumvent such write protection or to temporarily grant write access to the required memory areas.

As other parts of the image are write-protected as well, memory protection equally applies to approaches that modify non-code parts of the image such as the Import Address Table. That’s why the call to VirtualProtect is neccessary when Patching the IAT. Programs using runtime code modification often do not restrict themselves to changing existing code but rather generate additional code. Assuming Data Execution Prevention has been enabled, it is thus vital for such approaches to work properly that any code generated is placed into memory regions that grant execute access. While user mode implementations can rely on a feature of the RTL heap (i.e. using the HEAP_CREATE_ENABLE_EXECUTE when calling RtlCreateHeap) for allocating executable memory, no comparable facility for kernel mode exist — a potential instrumentation solution thus has to come up with a custom allocation strategy.

Jump distances

Whenever code is being generated, odds are that there are branching instructions involved. Depending on where memory for the new code has been allocated and where the branch targets falls, the offset between the branching instruction itself and the jump target may be of significant size. In such cases, the software has to make sure that the branch instruction chosen does in fact support offsets at least as large as required for the individual purpose. This sounds trivial, but it is not: Software that overwrites existing code with a branch may face severe limitation w.r.t. how many bytes the branch instruction may occupy — if, for example, there is less than 5 bytes of space (assuming IA-32), a far jump cannot be used. To use a near jump, however, the newly allocated code better be near.

Further safety concerns will be discussed in Part 2 of this series of posts.


About me

Johannes Passing lives in Berlin, Germany and works as a Solutions Architect at Google Cloud.

While mostly focusing on Cloud-related stuff these days, Johannes still enjoys the occasional dose of Win32, COM, and NT kernel mode development.

He also is the author of cfix, a C/C++ unit testing framework for Win32 and NT kernel mode, Visual Assert, a Visual Studio Unit Testing-AddIn, and NTrace, a dynamic function boundary tracing toolkit for Windows NT/x86 kernel/user mode code.

Contact Johannes: jpassing (at) hotmail com

LinkedIn Profile
Xing Profile
Github Profile