Runtime code modification, of self modifying code as it is often referred to, has been used for decades — to implement JITters, writing highly optimized algorithms, or to do all kinds of interesting stuff. Using runtime code modification code has never been really easy — it requires a solid understanding of machine code and it is straightforward to screw up. What’s not so well known, however, is that writing such code has actually become harder over the last years, at least on the IA-32 platform: Comparing the 486 and current Core architectures, it becomes obvious that Intel, in order to allow more advanced CPU-interal optimizations, has actually lessened certain gauarantees made by the CPU, which in turn requires the programmer to pay more attection to certain details.
Looking around on the web, there are plenty of code snippets and example projects that make use of self-modifying code. Without finger-pointing specific resources, it is, however, safe to assume that a significant (and I mean significant!) fraction of these examples fail to address all potential problems related to runtime code modification. As I have shown a while ago, even Detours, which is a well-done and widely recognized and used library relying on runtime code modification has its issues:
- Dangerous Detours, Part 1: Introduction
- Dangerous Detours, Part 2: Unexpected Behaviour
- Dangerous Detours, Part 3: Messing execution flow
- Dangerous Detours, Part 4: Undetouring
Adopting the nomenclature suggested by the Intel processor manuals, code writing data to memory with the intent of having the same processor execute this data as code is referred to as self-modifying code. On SMP machines, it is possible for one processor to write data to memory with the intent of having a different processor execute this data as code. This process if referred to as cross-modifying code. I will jointly refer to both practices as runtime code modification.
The easiest part of runtime code modification is dealing with the memory model. In order to implement self-modifying or cross-modifying code, a program must be able to address the regions of memory containing the code to be modified. Moreover, due to memory protection mechanisms, overwriting code may not be trivially possible.
The IA-32 architecture offers three memory models — the flat, segmented and real mode memory model. Current OS like Windows and Linux rely on the flat memory model, so I will ignore the other two.
Whenever the CPU fetches code, it addresses memory relative to the segment mapped by the CS segment register. In the flat memory model, the CS segment register, which refers to the current code segment, is always set up to map to linear address 0. In the same manner, the data and stack segment registers (DS, SS) are set up to refer to linear address 0.
It is worth mentioning that AMD64 has retired the use of segmentation and the segment bases for code and data segment are therefore always treated as 0.
Given this setup, code can be accessed and modified on IA-32 as well as on AMD64 in the same manner as data. Easy-peasy.
One of the features enabled by the use of paging is the ability to enforce memory protection. Each page can specify restrictions to which operations are allowed to be performed on memory of the respective page.
In the context of runtime code modification, memory protection is of special importance as memory containing code usually does not permit write access, but rather read and execute access only. A prospective solution thus has to provide a means to either circumvent such write protection or to temporarily grant write access to the required memory areas.
As other parts of the image are write-protected as well, memory protection equally applies to approaches that modify non-code parts of the image such as the Import Address Table. That’s why the call to VirtualProtect is neccessary when Patching the IAT. Programs using runtime code modification often do not restrict themselves to changing existing code but rather generate additional code. Assuming Data Execution Prevention has been enabled, it is thus vital for such approaches to work properly that any code generated is placed into memory regions that grant execute access. While user mode implementations can rely on a feature of the RTL heap (i.e. using the HEAP_CREATE_ENABLE_EXECUTE when calling RtlCreateHeap) for allocating executable memory, no comparable facility for kernel mode exist — a potential instrumentation solution thus has to come up with a custom allocation strategy.
Whenever code is being generated, odds are that there are branching instructions involved. Depending on where memory for the new code has been allocated and where the branch targets falls, the offset between the branching instruction itself and the jump target may be of significant size. In such cases, the software has to make sure that the branch instruction chosen does in fact support offsets at least as large as required for the individual purpose. This sounds trivial, but it is not: Software that overwrites existing code with a branch may face severe limitation w.r.t. how many bytes the branch instruction may occupy — if, for example, there is less than 5 bytes of space (assuming IA-32), a far jump cannot be used. To use a near jump, however, the newly allocated code better be near.
Further safety concerns will be discussed in Part 2 of this series of posts.