Concurrent Execution A typical user mode process on a Windows system can be expected to have more than one thread. In addition to user threads, the Windows kernel employs a number of system threads. Given the presence of multiple threads, it is likely that whenever a code modification is performed, more than one thread is affected, i.e. more than one thread is sooner or later going to execute the modified code sequence.
Performing modifications on existing code is a technique commonly encountered among instrumentation solutions such as DTrace. Assuming a multiprocessor machine, altering code brings up the challenge of properly synchronizing such activity among processors. As stated before, IA-32/Intel64 allows code to be modified in the same manner as data. Whether modifying data is an atomic operation or not, depends on the size of the operand. If the total number of bytes to be modified is less than 8 and the target address adheres to certain alignment requirements, current IA-32 processors guarantee atomicity of the write operation.
Instrumentation of a routine may comprise multiple steps. As an example, a trampoline may need to be generated or updated, followed by a modification on the original routine, which may include updatating or replacing a branch instruction to point to the trampoline. In such cases, it is essential for maintaining consistency that the code changes take effect in a specific order. Otherwise, if the branch was written before the trampoline code has been stored, the branch would temporarily point to uninitialized memory.
Runtime code modification, of self modifying code as it is often referred to, has been used for decades – to implement JITters, writing highly optimized algorithms, or to do all kinds of interesting stuff. Using runtime code modification code has never been really easy – it requires a solid understanding of machine code and it is straightforward to screw up. What’s not so well known, however, is that writing such code has actually become harder over the last years, at least on the IA-32 platform: Comparing the 486 and current Core architectures, it becomes obvious that Intel, in order to allow more advanced CPU-interal optimizations, has actually lessened certain gauarantees made by the CPU, which in turn requires the programmer to pay more attection to certain details.
Browsing through ACM Queue’s archives I came across the article Real-World Concurrency by Bryan Cantrill (who, by the way, is the inventor of DTrace) and Jeff Bonwick (Issue 5⁄2008). The article provides a nice summary of actual challenges and best practices for systems programming in a multithreaded/shared memory environment. Worth reading.