Runtime Code Modification Explained, Part 2: Cache Coherency Issues
Instrumentation of a routine may comprise multiple steps. As an example, a trampoline may need to be generated or updated, followed by a modification on the original routine, which may include updatating or replacing a branch instruction to point to the trampoline.
In such cases, it is essential for maintaining consistency that the code changes take effect in a specific order. Otherwise, if the branch was written before the trampoline code has been stored, the branch would temporarily point to uninitialized memory. If multiple CPUs were involved and code became subject to execution while in such an inconsistent state, undefined execution behaviour would occur.
The order in which a program specifies memory loads and stores to be conducted is referred to as program order. On processors such as the Intel 386, this order is preserved. Contemporary processors, however, implement significantly weaker memory models. In order to speed up execution, these processors allow certain memory operations to be conducted out of order. Such reordering may, in certain situations like the one depicted before, lead to wrong results or to windows of inconsistency. To avoid such situations from occuring, the program must explicitly prohibit certain reorderings to be performed, which can be done by using memory fences.
Respecting the memory model implemented by the processor is thus crucial in order to achieve safe operation. Although both read and store operations are subject to potential reordering, only reordering of store operations is of interest in the context of the example depicted above. However, the memory model and the memory order enforced by the various CPUs addressed in this chapter differs significantly.
IA-32 and Intel 64 implement a rather strong memory model. In particular, memory stores are always carried out in program order – this holds true for both uniprocessor and multiprocessor systems. For the situation depicted above, this means that storing the updated branch target is not conducted before all stores of writing the trampoline have completed.
SPARC V9 offers a choice between three different memory models, which differ in their guarantees they provide: Total Store Order (TSO), Partial Store Order (PSO), and Relaxed Memory Order (RMO) TSO, the strongest memory model among these three, guarantees presenvation of the order of store operations. As such, no memory fences are required. Both RMO, the weakest memory model, and PSO do not provide such guarantees. That is, to ensure that the second store is not carried out before the first store has completed, an appropriate memory fence instruction, i.e. a MEMBAR #StoreStore instruction has to be executed between the two stores.
The memory model implemented by IA-64 also allows stores to be conducted out of order. This can be prevented either by a memory fence or by specifying the second store to have release semantics: By using the st.rel instruction rather than st, the processor is indicated that this instruction must not take effect until all prior orderable instructions, which includes the first store, have taken effect.
Instruction Cache/Data Store Incoherencies
As mentioned before, many modern microprocessors, including contemporary IA-32, SPARC and IA-64 CPUs use dedicated instruction caches. Whether these instruction caches are kept coherent with data caches depends on the architecture. Again, IA-32 and Intel 64 are more forgiving than other CPUs in this regard and keep instruction and data caches coherent – no manual intervention for flushing the instruction cache is required (As a consquence of that, NtFlushInstructionCache is essentially a noop on these architectures).
SPARC does not maintain this coherency automatically. To have instruction changes take effect immediately, SPARC requires the developer to issue a FLUSH instruction for each modified machine word of instructions. In a similar manner, a fc.i instruction is required on IA-64 to flush the respective instructions from the instruction cache.
Pipeline/Instruction Cache Incoherencies
Another issue that may occur when writing self- or cross-modifying code is the processor’s pipeline to become incoherent with the instruction cache. That is, although the instruction cache contains updated instructions, the CPU may continue working with outdated instructions for a while.
According to the SPARC processor manual, SPARC is not exposed to this problem and flushing the instruction cache is sufficient to avoid this problem. On IA-64, however, this incoherency can occur – whether this is in fact a problem or not depends on the individual usage scenario. However, to force having the instruction flush take effect immediately and to synchronize the instruction cache with the instruction fetch stream, issueing a sync.i instruction is required.
On IA-32, the exact behavior in case of runtime code modification in general and such incoherencies in particular slightly varies among different models. On the one hand, guarantees concerning safety of such practices have been lessened over the evolvement from the Intel 486 series to the current Core 2 series. On the other hand, certain steppings of CPU models have been explicitly documented to exhibit defective behavior in this regard. Although detailed technical information on this topic is not available, these problems seem to be possible to emerge when code is being modified that is currently in the state of execution.
Due to this variance, the exact range of issues that can arise due to runtime code modification and cache inconherencies is not clear and appropriate countermeasures cannot be easily identified. The route chosen by the Intel documentation is thus to specify an algorithm that is guaranteed to work across all processor models – although for some processors, it might be more restricting than necessary.
For cross-modifying code, the suggested algorithm makes use of serializing instructions. The role of these instructions, cpuid being among them, is to force any modifications to registers, memory and flags to be completed and to drain all buffered writes to memory before the next instruction is fetched and executed.
It is worth pointing put that Intel 64 does not seem to be exposed to this issue. Moreover, as the AMD documents state:
Synchronization for crossmodifying code is not required for code that resides within the naturally aligned quadword.
Local/Remote Incoherencies
Finally, there may be dicrepancies between which code the local processor sees and which code other processors on a SMP systems see. That is, although the stores may have already taken effect on the local CPU, they may be delayed on other CPUs so that these CPUs may continue working with the old instructions for some amount of time.
In many cases, as long as ordering is preserved, delaying is not a major problem, However, when the changes should be enforced to take effect immediately, further steps are required.
Intel 64 specifies that
Stores from a processor appear to be committed to the memory system in program order; however, stores can be delayed arbitrarily by store buffering while the processor continues operation.
As a consequence, an MFENCE instruction should be executed as soon as the code patch has been written. In a similar way, IA-64 requires an mf (memory fence) instruction to be issued after the fc.i and sync.i if changes are to take effect immediately on remote CPUs.