Trace and Watch Data -- How does it work

One of the builtin WinDBG commands is wt (Trace and Watch Data), which can be used to trace the execution flow of a function. Given source code like the following:

    void foo()
    {
    }
    
    void bar()
    {
    }
    
    int main()
    {
      // Some random code...
      int a = 1, b = 2;
      
      // Call a child function.
      foo();
      
      // More useless code...
      a+=b;
      if ( a == b) a = b;
      
      // Call another child function.
      bar();  
      
      return 0;
    }

wt will produce the following output:

    0:000wt
    Tracing test!main to return address 00401291
        6     0 [  0] test!main
        1     0 [  1]   test!ILT+5(_foo)
        4     0 [  1]   test!foo
       13     5 [  0] test!main
        1     0 [  1]   test!ILT+0(_bar)
        4     0 [  1]   test!bar
       17    10 [  0] test!main
    
    27 instructions were executed in 26 events 
                                      (0 from other threads)
    
    Function Name         Invocations MinInst MaxInst AvgInst
    test!ILT+0(_bar)                1       1       1       1
    test!ILT+5(_foo)                1       1       1       1
    test!bar                        1       4       4       4
    test!foo                        1       4       4       4
    test!main                       1      17      17      17
    
    0 system calls were executed

Although helpful, tracing a larger function calling a multitude of other functions slows down the debuggee significantly. An interesting question is thus how wt is implemented. Three possible implementation strategies come to mind:

Use single-stepping. After each instruction executed, a debug trap is raised and the debugger is delivered a single-step debugging event. Though all non-branching instructions are probably irrelevant to wt, by intercepting each call and ret instruction, the debugger is able to trace function entry and exit.
Explicitly set breakpoints. The debugger disassembles the function to be traced and places an ordinary breakpoint on each call instruction as well on as the return address of the function. Whenever one of the call-breakpoints fires, the debugger instruments the target function in the same way (i.e. place breakpoints on each call instruction as well as the return address) and continues execution (without single-stepping). By intercepting all function calls and returns, the debugger is able to deduce the call tree. This approach would be similar to UMSS.
Use Last Branch Recording. This is a rather new additon to the IA-32 instruction set that allows setting breakpoints on taken branches, interrupts, and exceptions, and to single-step from one branch to the next.

In order to find out, we have to debug the debugger to observe how it debugs the target. We thus start WinDBG, choose our test application as target and let it break on main. We then start another WinDBG instance and attach it to the first WinDBG instance. In order to find out which debugging events are consumed by the first instance, we use the second debugger to trace function calls made by the first debugger.

All usermode debuggers eventually end up calling ntdll!NtWaitForDebugEvent in a loop – so to find out which debugging events are consumed, all we need to do is trace all calls to this function. While being an undocumented native function, there is an excellent summary on the inner workings of user mode debugging which also covers ntdll!NtWaitForDebugEvent. Given this information, all we need to do to check whether strategy #1 or strategy #2 has been implemented (I assume #3 may safely be neglected) is to put together a little breakpoint command like the following (line breaks added for clarity):

bp ntdll!NtWaitForDebugEvent "
   r @$t1=poi(esp+10); 
   g @$ra; 
   .if (poi(@$t1)==8) {.echo \"SingleStep\n\" } 
   .else {.printf \"Excp %p\\n\", poi(@$t1+c)};
   g "

When entering ntdll!NtWaitForDebugEvent, we store the address of the fourth parameter (which receives a PDBGUI_WAIT_STATE_CHANGE structure) in $t1 and step out of the function. Then we reach into the structure whose address is stored in $t1 and check if the first field marks the event of being of type DbgSingleStepStateChange (0x8) and output an appropriate message. If we receive about 30 single-step events, strategy #1 has probably been chosen. For #2 we would expect to receive 5 breakpoint events.

Back to the first debugger, we now opt to trace the main function by running wt. This yields the output shown above. Switching to the second debugger again, we now see the following output:

    SingleStep
    SingleStep
    SingleStep
    
    [...about 20 more...]
    
    SingleStep
    SingleStep
    SingleStep
    SingleStep
    SingleStep

Quite obviously, wt implements strategy #1 – it does single stepping. Although this does not really come as a surprise, it is still unfortunate as it is most likely the slowest approach of tracing calls. And as anybody who has ever used wt can probably confirm, wt is really slow.

As an interesting side note, as of Linux kernel 2.6.25, ptrace on x86 has been enhanced to facilitate Last Branch Recording on CPUs that support it.

Any opinions expressed on this blog are Johannes' own. Refer to the respective vendor’s product documentation for authoritative information.

« Back to home

Trace and Watch Data -- How does it work

Related posts