Runtime Code Modification Explained, Part 3: Cross-Modifying Code and Atomicity
Performing modifications on existing code is a technique commonly encountered among instrumentation solutions such as DTrace. Assuming a multiprocessor machine, altering code brings up the challenge of properly synchronizing such activity among processors.
As stated before, IA-32/Intel64 allows code to be modified in the same manner as data. Whether modifying data is an atomic operation or not, depends on the size of the operand. If the total number of bytes to be modified is less than 8 and the target address adheres to certain alignment requirements, current IA-32 processors guarantee atomicity of the write operation.
If any of these requirements do not hold, multiple write instructions have to be performed, which is an inherently non-atomic process. What is often ignored, however, is that even in situations where using atomic writes or bus locking (i.e. using the lock prefix) on IA-32 or AMD64 would be feasible, such practice would not necessarily be safe as instruction fetches are allowed to pass locked instructions. Quoting 7.1.2.2 of the Intel manual:
Locked operations are atomic w.r.t. all other memory operations and all externally visible events. Only instruction fetch and page table accesses can pass locked instructions. Locked instructions can be used to synchronize data written by one processor and read by another processor.```
Although appealing, merely relying on the atomicity of store operations must therefore in many cases be assumed to be insufficient for ensuring safe operation.
The exact behavior in case of runtime code modifications also slightly varies among different CPU models. On the one hand, guarantees concerning safety of such practices have, as indicated before, been lessened over the evolvement from the Intel 486 series to the current Core 2 series. On the other hand, certain steppings of CPU models even exhibit defective behavior in this regard, as explained in several Intel errata, including this one for the Pentium III Xeon.
Due to this variance, the exact range of issues that can arise when performing code modifications is not clear and appropriate countermeasures cannot be easily identified. As described in these errata, cross-modifying code not adhering to certain coding practices described later, can lead to “unexpected execution behavior”, which may include the generation of exceptions.
The route chosen by the Intel documentation is thus to specify an algorithm that is guaranteed to work across all processor models – although for some processors, it might be more restricting than necessary.
For cross-modifying code, the suggested algorithm makes use of serializing instructions. The role of these instructions, cpuid being one of them, is to force any modifications to registers, memory and flags to be completed and to drain all buffered writes to memory before the next instruction is fetched and executed.
Quoting the algorithm defined in the Intel manual:
(* Action of Modifying Processor *)
Memory_Flag <- 0; (* Set Memory_Flag to value other than 1 *)
Store modified code (as data) into code segment;
Memory_Flag <- 1;
(* Action of Executing Processor *)
WHILE (Memory_Flag != 1)
Wait for code to update;
ELIHW;
Execute serializing instruction;
Begin executing modified code;
To further complicate matters, the IA-32 architecture, as you know, uses a variable-length instruction set. As a consequence of that, additional problems not yet addressed may occur if the instruction lengths of unmodified and new instruction do not match. Two situations may occur
1. The new instruction is longer than the old instruction. In this case, more than one instruction has to be modified. Modifications straddling instruction boundaries, however, are exposed to an extended set of issues that will be covered in my next post.
2. The new instruction is shorter than the old instruction. The ramifications of this situation depend on the nature of the new instruction. If, for instance, the instruction is an unconditional branch instruction, the subsequent pad bytes will never be executed and can be neglected.If, on the other hand, execution may be resumed at the instruction following the new instruction, the pad bytes must constitute valid instructions. For this purpose, a _sled_ consisting of nop instructions can be used to fill the pad bytes.The algorithm defined by Intel for cross-modifying code ensures that neither the old nor the new instruction is currently being executed while the modification is still in progress. Therefore, when employing this algorithm, replacing a single instruction by more than one instruction can be considered to be equally safe to replacing an instruction by an equally-sized instruction.
It is worthwhile to notice that regardless which situation applies for instrumentation, the complementary situation will apply to uninstrumentation.